News Sites Crackdown on AI Crawlers

Web Crawler

In recent months, there has been an explosion of generative Artificial Intelligence (AI) as more companies continue to develop consumer-facing tools to help automate tasks, write documents, do research on various topics or even basic coding.

OpenAI’s ChatGPT took the world by storm following its release, having reached an estimated 100 million users just within two months after launch. Arguably, following its success more consumers and companies have continued to leverage generative AI models and tools in their daily tasks.

Since then, more generative AI tools have been released by other key tech players as well as smaller startups including Google’s conversational chatbot, Bard, Microsoft’s Bing chat, and OpenAI’s text-to-image generator, DALL·E 2.

However, the rise of large language models (LLMs) and generative AI have ushered new in challenges, bringing to light copyright issues. This has resulted in pushback from news sites, publishers, and intellectual property holders who see their data being collected by AI crawlers. With no clear regulatory rules controlling AI’s use of copyrighted material yet, some of the world`s largest news websites have taken matters into their own hands.

According to data presented by, nearly one-third of the world’s top 50 news sites have blocked AI crawlers from accessing their content, and their number continues rising.

Notably, CNN, the New York Times, the Daily Mail, Reuters, and Bloomberg Have All Blocked At least One AI Crawler.

Crackdown on AI crawlers

AI companies send crawlers to collect data to train their models and provide information for chatbots.

However, as data is one of their core advantages, many of the world’s largest news websites have become extremely cautious, especially since there is generally no upside to handing over their data to AI crawlers, according to AltIndex.

Last month OpenAI launched its GPTBot crawler to collect data to enhance its language models. This escalated the situation further despite assurances that paywalled content would be excluded from websites. Several high-profile news sites, including CNN, Reuters, and the New York Times, blocked GPTBot.

According to a Kirwan Digital Marketing Agency survey, 28% of the top 50 news sites worldwide have blocked at least one AI crawler by the end of last month.

The study reveals that OpenAI’s GPTBot has been blocked 22% of the time across the top 50 news sites, with Bloomberg, Reuters, Business Insider, Washington Post, the New York Times, and CNN as the top names on this list.

CCBot has been blocked about half as often as the GPTBot, with a 10% share across the top 50 news sites. The survey further shows that ChatGPT had been blocked by the Washington Post only, the same as AnthropicAI being blocked by only one website, NewsNow.

Overall, the New York Times, Washington Post, Reuters, and UK’s NewsNow lead in blocking AI crawlers from accessing their content, with each news site blocking two AI bots.

Previous articleInside the Challenging World of African Esports
Next articleData Disaster: 611GB of Naivas Customer Data Leaked in April
Just a guy who codes and writes | Bringing you the latest on all things tech. You can reach me via [email protected] or on Twitter.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.