Using synthetic or artificially generated data in training AI algorithms is a burgeoning practice with significant potential. It can address data scarcity, privacy, and bias issues and raise concerns about data quality, security, and ethical implications. This issue is heightened in the Global South, where data scarcity is much more severe than in the Global North. Synthetic data, therefore, addresses the problem of missing data, leading, in the best case, to better representation of populations in datasets and more equitable outcomes. However, we cannot consider synthetic data to be better or even equivalent to actual data from the physical world. In fact, there are many risks to using synthetic data, including cybersecurity risks, bias propagation, and simply an increase in model error. This technology brief proposes recommendations for the responsible use of synthetic data in AI training and the associated guidelines to regulate the use of synthetic data.
Suggested citation: Marwala Tshilidzi, Fournier-Tombs Eleonore and Stinckwich Serge. The Use of Synthetic Data to Train AI Models: Opportunities and Risks for Sustainable Development : UNU Centre, UNU-CPR, UNU Macau, 2023.