Reddit Limits Internet Archive in Effort to Stop AI Data Scraping

Reddit has announced that it will limit the Internet Archive’s Wayback Machine from accessing most of its site, in a move it says is aimed at stopping AI companies from scraping user content and protecting Redditor privacy.

The Wayback Machine, run by the nonprofit Internet Archive, is a well-known tool for viewing snapshots of web pages as they appeared on specific dates. Its mission is to preserve a digital record of the internet and other cultural artifacts.

However, Reddit now believes that not all of its content should be archived in this way, particularly posts and comments that have been deleted or removed for privacy, safety, or moderation reasons.

“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content), we’re limiting some of their access to Reddit data to protect Redditors,” said Reddit spokesperson Tim Rathschmidt.

Why Reddit is Blocking Internet Archive Access

According to Reddit, the change comes after evidence emerged that AI companies had been using the Wayback Machine to scrape historical Reddit content, bypassing the company’s own restrictions and licensing agreements.

Reddit says it has previously raised concerns with the Internet Archive about this loophole.

Starting today, the new restrictions will “ramp up” gradually. The Wayback Machine will still be able to index Reddit’s homepage, but access to individual subreddits, threads, comments, and user profiles will be blocked.

Reddit says it reached out to the Internet Archive “in advance” to inform them before the limits took effect. The Internet Archive has not yet issued a public response, though its director of the Wayback Machine, Mark Graham, confirmed that discussions are ongoing.

This decision is part of a strategy by Reddit to control how its vast trove of user-generated content is used, especially in the age of AI:

In 2023, Reddit introduced controversial API changes that forced popular third-party apps to shut down, sparking widespread user protests. The company justified the move as necessary to stop AI model training without permission.
In 2024, it began blocking major search engines from crawling its site unless they paid for access.
It also struck licensing deals, including one with Google, for access to Reddit data for both search improvements and AI training.

Critics argue that limiting the Internet Archive will make it harder to preserve online history, removing an important tool for researchers, journalists, and the public.

Supporters say the move protects users from having deleted content resurfaced indefinitely and from having their words fed into commercial AI models without consent.

With Reddit’s IPO behind it and licensing revenue on the rise, the platform is signaling that its data is now a carefully guarded asset.