Experts say ChatGPT has already polluted the internet so badly that it's eroding Al development

When OpenAI released ChatGPT in late 2022, the world crossed a digital threshold. For the first time, a conversational AI could generate coherent, persuasive text at scale — and the implications rippled across industries, from education to law to software development. But beneath the excitement, a quieter concern has been growing among researchers and ethicists: what happens when AI systems begin to learn not from us, but from themselves?

Contents show

The issue at the heart of this debate is not just about misinformation or intellectual property. It’s about the integrity of the data that underpins every modern AI system. As large language models generate more content — articles, code, forum replies, even academic-style essays — that content increasingly makes its way back into the very datasets used to train newer models. This recursive loop is prompting some experts to warn of a creeping phenomenon known as AI model collapse: the idea that if synthetic data continues to mix unchecked with human-created information, the next generation of AI may be less reliable, less creative, and less grounded in the real world than the last.

To understand this emerging risk, some researchers have turned to a surprisingly apt historical analogy: the search for “low-background steel.” Just as scientists once needed uncontaminated metals to build sensitive radiation detectors after the nuclear age began, today’s AI developers may need access to “clean” pre-2022 data to train systems that remain trustworthy and useful in an increasingly synthetic digital landscape.

The “Low-Background Steel” of Data — Why Clean Data Matters for AI Integrity

In the aftermath of World War II and the dawn of the atomic age, scientists faced an unexpected technical hurdle: steel produced after the detonation of the first nuclear bombs had become subtly contaminated with radioactive particles. These particles, present in the atmosphere due to above-ground nuclear testing, embedded themselves in newly manufactured metals, making them unsuitable for sensitive instruments like Geiger counters or certain types of medical scanners. To overcome this, researchers turned to an unlikely solution — salvaging steel from sunken naval ships that had been submerged before the nuclear era. These relics, such as vessels scuttled in 1919 by the German fleet at Scapa Flow, had been shielded from radioactive fallout by the ocean itself. Known as “low-background steel,” this uncontaminated metal became essential for applications where even the slightest radiation would skew results.

Today, artificial intelligence researchers find themselves facing a disturbingly similar dilemma — not with radioactive particles, but with data. Since the launch of OpenAI’s ChatGPT in late 2022, large language models have poured synthetic content into the digital ecosystem at an unprecedented rate. Articles, essays, code snippets, emails, forum posts, and even reviews are increasingly being generated — or subtly rewritten — by AI. This wave of artificial content is beginning to intermingle with the very training data that fuels future AI development. As a result, experts warn that we may be corrupting the integrity of future models by training them on the outputs of other models, rather than on clean, diverse, and authentically human-created information. The risk is a phenomenon known as AI model collapse, where systems become self-referential and progressively detached from factual accuracy, contextual understanding, and linguistic diversity.

To illustrate this concern, John Graham-Cumming — then CTO of Cloudflare — coined a digital version of the low-background steel metaphor. In March 2023, he registered the domain lowbackgroundsteel.ai and began curating discussions about pre-2022 data repositories that remain free of generative AI influence. One such repository is the Arctic Code Vault, a 2020 GitHub initiative designed to preserve open-source code in a long-term archive. Graham-Cumming’s concern was not merely nostalgic; he saw the increasing difficulty in separating authentic human expression from algorithmically generated text as a risk to the epistemic foundations of AI. By analogy, if nuclear fallout permanently altered the composition of steel, AI pollution could irreversibly compromise the quality of training data, degrading the performance of models over time.

This growing anxiety is not limited to niche circles. Academic researchers and policy experts alike are beginning to echo these concerns. Ilia Shumailov and colleagues have coined the term Model Autophagy Disorder (MAD) to describe this recursive training phenomenon, where models “eat” their own output in successive generations. The fear is that as the internet becomes increasingly saturated with synthetic content, each new generation of models becomes more detached from genuine human communication, creativity, and error — the very traits that give human language its depth and meaning. Maurice Chiodo, a mathematician and research associate at Cambridge’s Centre for the Study of Existential Risk, argues that this contamination affects not only truthfulness, but usability. “You can build a very usable model that lies,” he points out, highlighting the tension between accuracy and coherence that lies at the heart of the problem.

Model Collapse — A Real Risk or Technical Overreaction?

As concerns about data contamination in AI training datasets grow, so too does the discussion around a looming phenomenon known as model collapse — a scenario in which AI systems, trained repeatedly on the outputs of earlier AI models, begin to degrade in quality, creativity, and factual accuracy. While the term may sound dramatic, it represents a subtle but serious risk: the gradual narrowing of what AI understands, misrepresents, or fails to capture altogether. Unlike a software bug or system failure, model collapse doesn’t announce itself with a crash. Instead, it seeps into outputs slowly — through increasing redundancy, false fluency, and a subtle detachment from the real world.

The mechanics behind this degradation are relatively straightforward in theory. As generative AI produces more content, that content is scraped, repackaged, and re-ingested into the training pipelines of newer models. Without careful filtration, these models increasingly rely on synthetic outputs rather than original human input. The result is a kind of digital inbreeding: models that echo prior outputs rather than learn from diverse, grounded, and novel human expression. Academic researchers have described this as Model Autophagy Disorder (MAD) — a metaphor borrowed from cellular biology to describe systems that consume themselves over time. In technical terms, this can lead to a loss of “distributional diversity” — the range of linguistic styles, concepts, and patterns that allow models to generalize effectively.

Despite the clarity of the concept, the actual evidence for model collapse is far from settled. Some researchers argue that the risks are overstated or at least containable with improved filtering, labeling, and better model design. For example, recent Apple research analyzed several high-performing reasoning models — including OpenAI’s GPT-4 variants and Anthropic’s Claude — and suggested that reasoning performance begins to deteriorate under certain conditions, especially when models are overwhelmed with too many tokens or complex prompts. However, these findings were met with strong counterarguments. Alex Lawsen of Open Philanthropy, working with assistance from Claude Opus, challenged the methodology, noting that forcing models to exceed their token limits can misrepresent their true capabilities. His critique suggests that, in some cases, what appears to be collapse may simply be poor experimental design rather than an actual decline in model intelligence.

Nonetheless, the volume of synthetic content is growing exponentially, and with it, the difficulty of maintaining clean, diverse datasets. A December 2024 paper from academics at institutions including the University of Cambridge and Heinrich Heine University Düsseldorf reiterated the core concern: that the widespread ingestion of AI-generated content threatens not only the epistemic stability of AI systems but also competition in the AI industry itself. If early market leaders are the only ones with access to pre-2022, human-generated datasets — the “low-background steel” of digital content — then future entrants will be disadvantaged, creating a structural imbalance. Their models may start from a position of contamination, limiting both performance and credibility from the outset.

The New Data Divide — Clean Datasets as a Competitive Advantage

As the internet becomes increasingly saturated with AI-generated content, access to clean, human-created data is emerging as one of the most valuable and unevenly distributed assets in the artificial intelligence race. According to a group of academics from the University of Cambridge, Heinrich Heine University Düsseldorf, and other institutions, the real long-term risk isn’t just that AI systems will degrade in performance — it’s that only a small group of early movers will retain the ability to build reliable, high-functioning models, simply because they still possess uncontaminated training data. This dynamic could create a self-reinforcing loop of dominance, where only the largest, most established players can maintain quality, while newcomers are left with a degraded pool of synthetic content that weakens their models from the outset.

This concern was outlined in a December 2024 academic paper titled “Legal Aspects of Access to Human-Generated Data and Other Essential Inputs for AI Training.” The authors argue that training AI models on post-2022 internet data — now increasingly laced with chatbot responses, AI-written blogs, and synthetic code — is analogous to building sensitive instruments out of radioactive steel. Models trained on such content risk becoming distorted or shallow, not because of intentional manipulation, but because their source material is itself the product of machines with incomplete or biased understanding. In such an environment, control over “low-background data” could become a form of competitive moat — not unlike proprietary chip designs or cloud infrastructure.

Maurice Chiodo, one of the paper’s co-authors, framed the issue in stark terms: if all available datasets are irreversibly contaminated, there may be no going back. The stakes, then, are not just about performance today, but about the future possibility of building trustworthy models at all. “If you’ve completely contaminated all your datasets, all the data environments… it’s very hard to undo,” he cautioned. The threat is not hypothetical — it’s already shaping who has the tools to lead in AI innovation and who may be locked out entirely.

Why Cleaning Up AI Contamination May Be Harder Than We Think

If the proliferation of AI-generated content is polluting the digital environment, then a natural question follows: can we clean it up? The short answer, according to experts, is that it’s extremely difficult — and the long answer reveals a complex web of technical, legal, and jurisdictional obstacles that make straightforward solutions nearly impossible to implement at scale.

One of the most commonly proposed interventions is the mandatory labeling or watermarking of AI-generated content. In theory, if every image, paragraph, or line of code produced by a language model carried a detectable marker, then downstream systems could identify and filter synthetic material before retraining on it. However, as researchers like Maurice Chiodo have pointed out, this approach is far less effective in practice than it sounds. For one, textual watermarking is easily stripped, especially when content passes through multiple platforms or is edited — even slightly — by humans. In a global internet where content is continuously reposted, summarized, and transformed, even the most robust watermarking schemes degrade rapidly.

The problem compounds when we consider visual content or audio, which may require jurisdiction-specific labeling standards. What counts as sufficient disclosure in one country might not meet the bar in another. And because web content is scraped globally, the weakest link in enforcement — whether it’s a lenient platform or an under-regulated country — becomes a loophole that undermines the whole system. “Anyone can deploy data anywhere on the internet,” Chiodo explains, “and so because of this scraping of data, it’s very hard to force all operating LLMs to always watermark output that they have.”

Another proposed remedy is to limit or control which data is available for training — for instance, by centralizing access to verified, human-generated datasets. However, this raises its own ethical and political dilemmas. Who decides what data is “clean”? Who governs access, and under what rules? If governments build centralized repositories of trusted datasets, they may inadvertently introduce vulnerabilities around privacy, censorship, and geopolitical misuse. What happens if a previously trustworthy government changes hands? Or if such a dataset becomes a target for state or corporate actors seeking to manipulate future AI systems? As Chiodo noted, even well-intentioned data governance schemes must grapple with serious questions of security, legitimacy, and long-term political stability.

Federated learning, which allows AI models to be trained on local datasets without exporting the data itself, offers a partial solution. This method could allow entities with access to untainted data — such as hospitals, universities, or private archives — to contribute to model training without giving up control. Yet federated learning comes with steep technical requirements, and it demands a high degree of trust and coordination among stakeholders who may not have aligned incentives. Furthermore, this doesn’t solve the core problem for open-domain language models, which are built on the scale and variety of publicly available internet data — the very data most vulnerable to AI contamination.

A Narrowing Window — Why Safeguarding AI’s Foundations Demands Urgent Action

Artificial intelligence may be built on complex algorithms and neural architectures, but its true foundation is far simpler — it’s data. Specifically, the vast, intricate, and often messy record of human communication, thought, and creativity. As this foundation grows increasingly polluted by AI-generated content, experts warn we may be sleepwalking toward a future where our most powerful technologies are disconnected from the human experience they’re meant to reflect. And unlike past tech crises that could be patched or reversed, this contamination may be uniquely difficult to undo.

The concern isn’t only about degraded chatbot performance or unreliable search results. It’s about the long-term epistemic stability of digital knowledge — what future AI systems “know,” how they reason, and whether they can still learn anything new about the world. If training data becomes a closed loop of synthetic mimicry, innovation will slow, models will stagnate, and trust in AI systems could erode in sectors where accuracy, fairness, and transparency are non-negotiable: medicine, law, science, journalism, and governance. The very systems we’re investing in to augment decision-making could become illusions of intelligence, trapped in feedback loops of their own design.

Maurice Chiodo and his co-authors are clear about what’s at stake: if we fail to preserve access to clean, diverse, human-originated data, then the future of AI becomes less about progress and more about containment. “If the government cares about long-term good, productive, competitive development of AI… then it should care very much about model collapse,” Chiodo warns. That means building legal, infrastructural, and ethical safeguards now — not when the signs of collapse become irreversible. But policy has been slow to act. While Europe’s AI Act signals some movement toward meaningful oversight, countries like the U.S. and U.K. continue to take a largely hands-off approach, concerned that regulation could stifle innovation. The risk, of course, is the opposite: that innovation itself will be stifled by a lack of foresight.

Experts say ChatGPT has already polluted the internet so badly that it’s eroding Al development

The “Low-Background Steel” of Data — Why Clean Data Matters for AI Integrity

Model Collapse — A Real Risk or Technical Overreaction?

The New Data Divide — Clean Datasets as a Competitive Advantage

Why Cleaning Up AI Contamination May Be Harder Than We Think

A Narrowing Window — Why Safeguarding AI’s Foundations Demands Urgent Action

Leave a Reply Cancel reply