The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development

Tshilidzi Marwala; Serge Stinckwich

The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development

This technology brief explores the potential of synthetic data to accelerate the attainment of the SDGs through AI in the Global South.

Publication Date: 4 Sep 2023

Authors: Tshilidzi Marwala Eleonore Fournier-Tombs Serge Stinckwich

Using synthetic or artificially generated data in training AI algorithms is a burgeoning practice with significant potential. It can address data scarcity, privacy, and bias issues and raise concerns about data quality, security, and ethical implications. This issue is heightened in the Global South, where data scarcity is much more severe than in the Global North. Synthetic data, therefore, addresses the problem of missing data, leading, in the best case, to better representation of populations in datasets and more equitable outcomes. However, we cannot consider synthetic data to be better or even equivalent to actual data from the physical world. In fact, there are many risks to using synthetic data, including cybersecurity risks, bias propagation, and simply an increase in model error. This technology brief proposes recommendations for the responsible use of synthetic data in AI training and the associated guidelines to regulate the use of synthetic data.