Generative artificial intelligence (AI) has revolutionized how we create content, from narratives to images. The statistical principle of maximum likelihood estimation (MLE) is central to many generative models. MLE enables models to learn from large datasets by finding patterns that maximize the probability of producing outputs that match the data. While this technique has proven highly effective for generating realistic content, it presents significant limitations when accurately describing rare or underrepresented information.
It is crucial to understand these limitations to make informed decisions and advancements in AI. This issue became apparent in my recent interaction with ChatGPT, a widely used AI tool. When asked, “Who is Tshianeo Marwala?”, ChatGPT confidently responded, “Tshianeo Marwala was the wife of Professor Tshilidzi Marwala.” However, Mrs. Tshianeo Marwala is my grandmother, not my spouse. The AI made this mistake because, statistically, public figures are more frequently associated with their spouses in data than with their grandparents. This demonstrates the limitations of MLE; it prioritizes relationships or data points that are more common in the training data, leaving it prone to errors when dealing with rarer situations.
A deeper layer to this issue is the reliance of generative AI models on concentrated platforms such as Wikipedia and a small set of high-traffic, widely-used sources. While Wikipedia serves as an invaluable knowledge base, its editorial bias, data gaps, and uneven global coverage affect the diversity and accuracy of information fed into AI models. As a result, generative AI systems, already biased toward statistically frequent patterns, are further constrained by an overdependence on a limited set of centralized platforms. This compounds the challenge of accurately representing marginalized perspectives and rare data.
Understanding maximum likelihood estimation in generative AI
At its core, MLE selects parameters for a model that maximize the likelihood of the observed data. This approach makes sense when working with common information that appears frequently in the training data. For example, suppose the AI is trained on large datasets that frequently discuss public figures and their spouses. In that case, it will be inclined to associate spouses more often than other family members. This led to the error in identifying Mrs. Tshianeo Marwala as a spouse when, in fact, she is my grandmother.
MLE, however, struggles when it encounters data that does not conform to these frequent patterns. This inherent bias towards what is statistically more common underscores the need for a more nuanced approach to MLE. In our research, outlined in our 2000 paper titled “Detection and Classification of Faults Using Maximum Likelihood and Bayesian Approaches”, we explored how MLE methods can detect and classify system faults. The paper demonstrates how MLE works well when dealing with clearly defined, frequent data patterns but struggles with rare faults that require more nuanced approaches. Similarly, in generative AI, MLE is well-suited for handling frequently represented information but fails when tasked with accurately representing less common relationships or information.
MLE’s undermining of innovation and diversity
Beyond its challenges in representing rare information, MLE can also undermine innovation by limiting the diversity of ideas and product differentiation. In many cases, breakthroughs and new perspectives emerge from the margins of probability distributions, where less likely but more unique ideas reside. MLE, by prioritizing the most statistically common patterns, discourages exploration of these outlier ideas that often drive innovation. As a result, generative AI systems focused too narrowly on MLE may produce content and solutions that converge toward the mainstream, lacking the creativity and distinctiveness needed to foster true innovation. In industries ranging from product design to content creation, this narrowing of the scope could stifle differentiation, leading to homogeneity rather than the diversity of ideas that fuels progress and competitiveness.
The problem of underrepresentation
This tendency to favour common patterns leads to significant problems when AI interacts with underrepresented data. Consider the example of generative AI trained primarily on centralized platforms like Wikipedia, Western news outlets, and cultural narratives. The model will excel at generating content based on dominant perspectives, but it will struggle to represent indigenous knowledge systems, niche academic fields, or lesser-known cultural traditions. These areas, being underrepresented in the data, are misrepresented or omitted entirely by models that rely heavily on MLE and concentrated data sources. Therefore, when a system is trained on data that overrepresents certain groups or relationships, it cannot handle the full diversity of human experience accurately.
Generative AI and bias
MLE limits the representation of rare information and amplifies the biases already present in the data. An MLE-based AI model will reinforce these patterns if a training dataset reflects societal inequalities, such as overrepresenting certain genders, ethnicities, or regions. In the case of Mrs. Tshianeo Marwala being misidentified, the AI’s bias toward commonly represented familial relationships is emblematic of a broader challenge: AI struggles with complexity in relationships, cultures, and identities that are less represented in its training data.
The concentration of knowledge from sources like Wikipedia exacerbates these biases. Despite its utility, Wikipedia has gaps and biases, particularly in its coverage of marginalized voices, Global South perspectives, and less-documented cultural histories. This over-reliance creates a feedback loop where the AI outputs are limited to the same narrow scope of knowledge, perpetuating existing biases and misrepresentations.
The need for distributed knowledge systems in generative AI
To address these limitations, we must build generative AI systems capable of synthesizing distributed information from a wide range of sources rather than depending on a small set of concentrated platforms. This will require AI to aggregate data from diverse databases, repositories, and knowledge systems worldwide, capturing more inclusive and representative insights. By integrating distributed knowledge, AI can better understand underrepresented information and reduce bias in its outputs.
Moreover, generative AI must also be equipped to filter out new information effectively to ensure its outputs reflect the latest knowledge and advancements in the field. With the rapid pace of information generation, the ability to assess, prioritize, and synthesize new data points will be crucial for AI to remain accurate and relevant. By designing systems that continuously update from various sources, we can create AI models that are more current and capable of handling nuanced and evolving contexts.
Addressing the issue: beyond maximum likelihood
To mitigate these challenges, researchers are exploring alternative approaches to MLE. One promising method is reinforcement learning, which allows models to learn through feedback and adjust based on incentives, focusing on accuracy and diversity rather than probability alone. This approach, even though not perfect, can help ensure that rare information is more accurately represented, even when it appears less frequently in the data.
Another solution is data augmentation, which involves enriching the training data with synthetic or carefully curated examples of underrepresented information. By intentionally introducing rare perspectives into the training data, AI systems can learn to account for more diverse inputs and provide more accurate outputs.
Collaboration between AI developers and experts in underrepresented fields is also critical to improving AI systems. Interdisciplinary partnerships can help ensure that generative AI systems are not solely reliant on statistical likelihood but can instead draw from a more diverse range of knowledge and experiences.
In conclusion, while maximum likelihood estimation has been integral to the development of generative AI, it has limitations — particularly when it comes to representing rare or marginalized information. My interaction with ChatGPT regarding Mrs. Tshianeo Marwala, and insights from my paper on the MLE approach, highlight how generative AI systems can falter when accurately describing less common relationships or data points. This issue stems from the AI’s reliance on frequently represented patterns and its dependence on concentrated platforms like Wikipedia for knowledge.
As generative AI continues to permeate various aspects of society, addressing these limitations is crucial to ensuring fairness, inclusivity, and accuracy. Moving beyond maximum likelihood estimation requires technical innovations and a commitment to diversity in AI development. This evolution will involve designing AI systems that synthesize distributed knowledge and filter out new information, making them more adaptive, inclusive, and up-to-date. Only by doing so can we fully unlock AI’s potential to represent the common, well-known, rare, unique, and underrepresented, like my grandmother, Mrs. Tshianeo Marwala.
Suggested citation: Marwala Tshilidzi. "The Limitations of Maximum Likelihood Estimation in Generative AI: An Obstacle to Representing Rare Information," United Nations University, UNU Centre, 2024-10-21, https://unu.edu/article/limitations-maximum-likelihood-estimation-generative-ai-obstacle-representing-rare.