AI could spell the end of the Wayback Machine as news sites increasingly block it to prevent content scraping

A growing number of major news sites are blocking the Wayback Machine
It reportedly includes 23 organizations that prevent their content from appearing in the archive
This is due to fears of the Wayback Machine being exploited for AI content scraping

The Wayback Machine is under serious threat (and not for the first time) as a growing number of major news sites appear to be blocking the archiving system.

If you’re not familiar with the Wayback Machine, it’s run by the non-profit Internet Archive and is essentially a time machine that preserves a history of the web (and more). This can be crucial when it comes to historical research, for example, or monitoring changes to websites.

As Wired reports (via 9 to 5 Mac), there’s a growing trend for online news outlets to block the web crawler that the Internet Archive uses to gather its snapshots. About 23 major news sites are now doing so, according to Originality AI (which specializes in AI detection).

The article continues below

That includes the New York Times (based on a Nieman Lab report) and USA Today, with Wired highlighting that the latter recently published a report on how US Immigration and Customs Enforcement delayed the release of key information about the impact of detention policies. This was a piece that used the Wayback Machine extensively in its research.

The irony of USA Today using this data in such a way, yet blocking the Wayback Machine from accessing its own content — which could potentially keep the news site itself honest in the future — is not lost on Wayback Machine CEO Mark Graham.

Graham told Wired: “They are able to collect their historical research because the Wayback Machine exists. At the same time, they are blocking access.”

Of course, if more and more organizations start blocking the Wayback Machine, then its ability to keep a historical record of online content will be increasingly eroded.

(Image credit: Getty Images)

Analysis: Blame AI (again)

So why does this happen? This isn’t about readers bypassing paywalled content using the Wayback Machine, if you thought that was what was going on. Would it surprise you to learn that it’s actually about AI, in a roundabout way? It wouldn’t, of course, and predictably the Internet Archive seems to be caught up in the broad backlash against AI here.

What these news organizations say they object to is not a historical record of their content being maintained, but the fact that this archive can be used by third-party AI firms to train their models (LLMs).

As Wired points out, New York Times spokesman Graham James said, “The problem is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.”

In short, the concern for these companies is that they may be able to block such AI scraping activities themselves, but this will still happen behind their backs via the Wayback Machine. It’s not just major news outlets that have these concerns either, but social media platforms, notably Reddit, have blocked the Wayback Machine’s web crawler due to the exact same concerns.

While there are other possible sources and ways to indirectly scrape news content, the Wayback Machine is the most obvious target for rogue AI operators since it maintains such an extensive library of web history.

So this is a complex issue associated with AI scraping and a whole lot of gray areas in terms of the legality of it. The effect on what is an important resource for keeping track of governments or media giants — and holding them accountable for what was said in the past, or what has been completely deleted from the web in some cases — is clearly worrying.

Graham argues that: “There is no doubt that the general shutdown of more and more of the public web is affecting society’s ability to understand what is going on in our world.”

A petition entitled ‘Journalists applaud Internet Archive’s role in preserving the public record’ has been put together and sent off with over 100 signatures from working journalists. In the meantime, a dialogue is still ongoing between the Internet Archive and said news publishers, so the hope of finding a workable solution here is not yet lost.

The best computers for all budgets

Follow TechRadar on Google News and add us as a preferred source to get our expert news, reviews and opinions in your feeds. Be sure to click the Follow button!

And of course you can too follow TechRadar on TikTok for news, reviews, video unboxings, and get regular updates from us on WhatsApp also.

Must Read

Leave a Comment Cancel Reply