Jeremy Corbello
- Jul 28, 2023
- 2 min read

The Self-Consuming Spiral of AI: How AI-Generated Data May Corrupt Future Models

The proliferation of generative artificial intelligence (AI) has led to an explosion of AI-generated content across the internet. While this has opened up new avenues for creativity and productivity, it also presents a potential pitfall for future AI models. This article, originally published in Scientific American and authored by Rahul Rao, explores the concept of AI models inadvertently introducing errors into future generations of models due to the ingestion of AI-generated data.

The Poisonous Cycle:

Generative AI, capable of producing text, computer code, images, and music, is increasingly being used to populate the internet with content. However, as AI developers scrape the internet for data to train new models, they may inadvertently include AI-generated content, which could introduce errors that accumulate over successive generations of models. This phenomenon, known as "model collapse," is likened to the contamination of newly-made steel with radioactive fallout from nuclear testing in the 20th century.

The Evidence and Implications:

A growing body of evidence suggests that even a small amount of AI-generated text in the training data can eventually become "poisonous" to the model being trained. This effect is observed even in relatively modest models, with errors building atop one another with each iteration. The research indicates that a model will suffer most at the "tails" of its data—the data elements that are less frequently represented in a model’s training set. This could lead to a loss of diversity in AI's output and exacerbate existing biases against marginalized groups.

The Current Scenario:

AI-generated content is already beginning to infiltrate areas that machine-learning engineers rely on for training data. Mainstream news outlets have started publishing AI-generated articles, and some Wikipedia editors are considering using language models to produce content for the site. Moreover, crowd-work platforms, such as Amazon’s Mechanical Turk, which are often used to annotate models’ training data or review output, are also seeing an influx of AI-generated content.

Potential Solutions:

To counter the threat of model collapse, researchers are considering the use of data known to be free from generative AI’s influence. This could involve the use of "standardized" image data sets curated by humans. However, discerning human-generated data from synthetic content and filtering out the latter is far from a straightforward task. The challenge lies not only in developing the technology for this but also in defining what constitutes AI-generated content in a world where tools like Adobe Photoshop allow users to edit images with generative AI.

Conclusion:

The rise of generative AI and the subsequent influx of AI-generated content present a unique challenge for the development of future AI models. While the technology offers immense potential, it also underscores the need for careful consideration of the data used to train these models. As we navigate this new landscape, the focus must be on ensuring the integrity of training data to prevent the self-consuming spiral of AI.

Source: Scientific American