Algorithm Bias — Synthetic Data Should Be Option of Last Resort When Training AI Systems

I recently read about how artificial intelligence (AI) makes its own data to train itself because it is running out of data to train from.

There was a story in the Financial Times about how several top companies use AI to produce data that the same AI system uses to train itself, for example, Large Language Models (LLMs) such as Chat GPT. Another article discusses AI systems trained using data generated by an AI system.

I have to say from the outset that no synthetic data is better than data from the physical world. For instance, if someone wants to make an AI system that can tell the difference between cancerous cells and normal cells, the best way is to give the AI system images of cancerous and normal cells taken from actual cells, not synthetic cells.

Anything else, like synthetic data of cancerous and healthy cells, makes the AI detection system less reliable. Despite all this, researchers are generating synthetic data.

Synthetic data should be the option of last resort and not the first option when training AI systems.

AI has changed many things about our lives, including synthesizing and using data. One of the most exciting uses of AI is making data that does not exist. Synthetic data is made in a computer instead of coming from actual events. In this regard, synthetic data is not the real deal, but fake!

And fake data gives impaired AI systems. Therefore, synthetic data should be the option of last resort and not the first option when training AI systems. When synthetic data is used to train AI systems, it must be used cautiously.

Synthetic data is artificially generated information that resembles actual data in terms of essential characteristics and statistical properties but does not correspond to actual events. It is frequently employed when actual data is limited, sensitive or costly to collect. When actual data is unavailable or unusable, synthetic data can be utilized for model training, testing and validation.

AI, specifically machine learning, is crucial in generating synthetic data.

Generative models such as generative adversarial networks are frequently used to create synthetic data. AI can also generate synthetic data using data augmentation techniques, which create new data by modifying existing data.

But the existing data must be representative. The dilemma of using unrepresentative data to make unrepresentative data representative is problematic. In the case of image data, possible techniques include rotation, scaling, inversion and cropping, but then again, the representation dilemma also applies.

Algorithm bias difficulties

Some studies estimate that as much as 60% of the data used to train AI will be synthetic by 2024. Some of the reasons advanced for using synthetic data is to deal with issues of algorithm bias.

For example, more data is gathered in Europe than in Africa, even though Africa has a larger population than Europe. As a result, algorithms trained using this data for facial recognition, for example, will perform better for European faces than for African faces.

The technological solutions to augment the African dataset with synthetic data so that the AI algorithms understand the African faces as much as it understands the European faces are fraught with difficulties. Again, the representation dilemma is at play here.

It is tough to use the underrepresented African dataset to create synthetic African data to augment the underrepresented African dataset to make it representative.

In practice, however, many datasets used to train AI models are unbalanced, with some classes overrepresented (e.g. European faces in face recognition) and others underrepresented (e.g. African faces).

The only way this will work is if the original African database, even though limited, has all the classes of people available in the African population, which is not always the case.

Class representation is, therefore, a key to unlocking this dilemma. Class representation in training data ensures an AI system’s fairness and inclusivity. Class representation is the distribution of various categories or classes within the AI training data.

For instance, in a binary classification problem, the two classes could be “positive” and “negative”. The training data should ideally have an equal or at least adequate representation of all classes to ensure that the model learns to predict all classes accurately.

In practice, however, many datasets used to train AI models are unbalanced, with some classes overrepresented (e.g. European faces in face recognition) and others underrepresented (e.g. African faces). This imbalance can result in skewed AI models that perform well for overrepresented classes (European faces) but unfavourably for underrepresented classes (African faces).

This imbalance in class representation directly impacts the impartiality of AI systems.

A study in 2019 demonstrated that biased training data could result in discriminatory AI systems. For instance, a healthcare AI system trained predominantly on data from one gender may not perform as well for the other gender. This inequality in AI systems can have severe consequences, including exclusion and discrimination.

A study by Buolamwini and Gebru found that commercial gender classification systems had higher error rates for darker-skinned and female individuals due to a lack of training data for these groups. This exclusion can exacerbate existing social disparities and create a digital divide.

We need to fix these problems so that data poverty that leads to the need to generate synthetic data is minimized, especially in the developing world.

Another strategy is reducing class imbalance’s negative impact to ensure equity and inclusion. In addition, AI systems can be made more transparent by disclosing the characteristics of the training data and the system’s performance across various classes.

Ensuring diverse and proportionate class representation in training data is essential when developing inclusive AI systems.

Furthermore, Silicon Valley, the centre of high technology, creativity and social media worldwide, must become more inclusive. Silicon Valley and other similar centres must have people from different backgrounds. Most people working in Silicon Valley are men, mostly white or Asian. There need to be more women, black, Latino and indigenous people.

This lack of diversity affects how AI is designed and used and leads to biased algorithms. Hiring programmes should focus on diversity training to deal with unconscious bias and mentorship of underrepresented groups.

We need to tackle the economic problems that led to the overconcentration of resources in one area to the exclusion of others. The African continent is very much part of the technology value chain. For example, much of the raw materials used in technology are from Africa.

It is, therefore, essential to reform the global financial architecture to ensure that we create a digitally just world. We need to fix these problems so that data poverty that leads to the need to generate synthetic data is minimized, especially in the developing world.

This article was first published by Daily Maverick. Read the original article on the Daily Maverick website.

Suggested citation: Marwala Tshilidzi. "Algorithm Bias — Synthetic Data Should Be Option of Last Resort When Training AI Systems," United Nations University, UNU Centre, 2023-07-28, https://unu.edu/article/algorithm-bias-synthetic-data-should-be-option-last-resort-when-training-ai-systems.