News Companies Strengthen Efforts Against AI Scrapers

6
News Companies Strengthen Efforts Against AI Scrapers
News Companies Strengthen Efforts Against AI Scrapers

Africa-Press – Seychelles. SOME news publishers have embraced the rapid rise of AI by retooling strategies to chase citations instead of clicks and even reaching licensing agreements with AI companies to use their content.

However, others have maintained their conviction that AI Web scraping is a violation of copyright and a threat to journalism.

With new weapons emerging to defend their content from AI, some publishers are choosing to fight back.

The current state of AI Web scraping

Despite obtaining content from willing publishers, AI bots continue to “scrape” other Web content without permission.

Since the 1990s, that permission has been relayed by a Web site’s robots.txt file — the gatekeeper informing hungry Web site crawlers what content is fair game or off-limits.

But the robots.txt file is more of a courteous suggestion than an enforceable boundary.

Bryan Becker of Human Security offered further explanation to Press Gazette: “Robots.txt has no enforcement mechanism. It’s a sign that says ‘please do not come in if you’re one of these things’ and there’s nothing there to stop you. It’s just always been a standard of the Internet to respect it.

“Companies the size of Google, they respect it because they have the eyes of the world on them, but if you’re just building a scraper, it’d almost be more work for you to respect it than to ignore it because you’d have to make extra code to check it.”

The rise of third-party Web scrapers

Publishers opting to block AI companies from visiting their Web sites altogether have only given rise to third-party content-scrapers, “which openly boast about how they can get through paywalls, effectively steal content to order, allowing AI companies to answer ‘live’ news queries with stolen information from publishers,” the Press Gazette article noted.

Press Gazette cites ample evidence of third-party bots scraping reputable sources — such as AI search engine Perplexity successfully replicating a Wired article that was behind a robots.txt file (Perplexity later updated its policies to respect robots.txt).

The Press Gazette itself is using third-party scrapers to access paywalled content on the Financial Times Web site.

Additionally, “in off-the-record conversations with major newspaper publishers in the UK, experts confirmed that third-party scrapers are an increasing issue,” according to the article.

How AI Web scraping hurts publishers

The toll AI is taking on publishers is significant — and measurable.

For some, it’s a matter of declining Web traffic. Toshit Panigrahi, CEO of Tollbit, told Press Gazette that a popular sports Web site had “13 million crawlers from AI companies” that resulted in “just 600 site visits”.

For others, it’s rising bandwidth. ComputerWorld reported Wikipedia experienced “a 50% increase in the bandwidth consumed since January 2024,” a jump the Wikimedia Foundation attributes to “automated programmes that scrape the Wikimedia Commons image catalouge of openly licensed images to feed images to AI models.”

This sizable increase in bandwidth has forced Wikipedia’s Site Reliability team into a state of perpetual war against AI scrapers.

The mounting resistance against AI scrapers

The Internet Engineering Task Force (IETF)’s AI Preference Working Group (AIPREF) is one of the largest and most influential allies in the AI resistance.

As reported by ComputerWorld, the principal objective of AIPREF is to “contain AI scrapers” through two interrelated mechanisms:

Clear preferences: First, AIPREF seeks to establish “a common vocabulary to express authors’ and publishers’ preferences regarding use of their content for AI training and related tasks.”

New and improved boundaries: Then, AIPREF “will develop a ‘means of attaching that vocabulary to content on the Internet, either by embedding it in the content or by formats similar to robots.txt, and a standard mechanism to reconcile multiple expressions of preferences’.”

The ultimate idea here is to transform the “please don’t” of current robots.txt files into a “this is forbidden” hard line, giving publishers a clear say in what AI can and cannot mine for content.

However, without real regulation, legal repercussions, or a way of enforcing these restrictions, the best AIPREF can do is clarify its preferences and hope AI companies respect publishers’ explicit wishes.

Gloves off

But for those on the frontlines of the fight, the promise of new protocols and a hope for AI compliance is not enough. Increasingly, publishers are fighting back with emerging countermeasures.

AI tarpits: A cybersecurity tactic known as tarpitting has been drafted into the fight against AI by tenacious developers.

One such developer explained to Ars Technica how his AI tarpit, Nepenthes, works by “trapping AI crawlers and sending them down an ‘infinite maze’ of static files with no exit links, where they ‘get stuck’ and ‘thrash around’ for months.” Satisfying though this tactic may be, ComputerWorld warns that sophisticated AI scrapers can successfully avoid tarpits and worse, “even when they work, tarpits also risk consuming host processor resources.”

Poisoning: If you do have the spare processing power to set up a successful tarpit, it can afford a rare opportunity to go on the offensive.

As explained in the Ars Technica article, trapped AI scrapers “can be fed gibberish data, aka Markov babble, which is designed to poison AI models.”

Proof of work: Another emerging defense against AI is proof of work challenges like Anubis, described by The Register as “a sort of CAPTCHA test, but flipped: Instead of checking visitors are human, it aims to make Web crawling prohibitively expensive for companies trying to feed their hungry LLM (large language model) bots.”

A single human visitor only briefly sees the Anubis mascot while their browser completes a cryptographic proof of work challenge.

For AI companies deploying hordes of content scraping bots, these computations can require “… a whole datacentre spinning up to full power. In theory, when scanning a site is so intensive, the spider backs off.”

Cloudflare strikes back

One of the most recent and significant blows against AI has been landed by Cloudflare, the Internet’s leading infrastructure provider.

Inundated with clients struggling to protect their Web sites from AI scrapers, Cloudflare has reversed its original “opt-out” model and now automatically blocks AI bots.

Press Gazette reported Cloudflare’s decision was “backed by more than a dozen major news and media publishers including the Associated Press, The Atlantic, BuzzFeed, Condé Nast, DMGT, Dotdash Meredith, Fortune, Gannett, The Independent, Sky News, Time and Ziff Davis.”

Cloudflare is also offering a more aggressive approach called AI Labyrinth, a tarpit/poisoning-inspired tool designed to ensnare AI scrapers.

Citing a CloudFlare blog post, The Verge explained that, when Labyrinth “detects ‘inappropriate bot behaviour’, the free, opt-in tool lures crawlers down a path of links to AI-generated decoy pages that ‘slow down, confuse, and waste the resources’ of those acting in bad faith.”

Can AI Web scrapers really be stopped?

The age of publishers watching passively as AI bots scrape their content is over.

Some, like The Guardian and The Wall Street Journal, are striking deals and throwing open the gates to AI.

Others are communicating firm boundaries, setting technical traps, and collaborating with like-minded leaders to develop effective defences.

Whether AI Web scrapers can be brought to heel remains to be seen, but for the publishers resisting the AI takeover, it’s clear the fight must continue if they want to retain control over their content.

For More News And Analysis About Seychelles Follow Africa-Press

LEAVE A REPLY

Please enter your comment!
Please enter your name here