Latest news with #AILabyrinth

A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish

Business Insider

01-05-2025

Business
Business Insider

A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish

Bots now generate more internet traffic than humans, according to cybersecurity firm Thales. This is being driven by web crawlers from tech giants that harvest data for AI model training. Cloudflare's AI Labyrinth misleads and exhausts bots with fake content. A data point caught my eye recently. Bots generate more internet traffic to websites than humans now, according to cybersecurity company Thales. This is being driven by a swarm of web crawlers unleashed by Big Tech companies and AI labs, including Google, OpenAI, and Anthropic, that slurp up copyrighted content for free. I've warned about these automated scrapers before. They're increasingly sophisticated and persistent in their quest to harvest information to feed the insatiable demand for AI training datasets. Not only do these bots take data without permission or payment, but they're also causing traffic surges in some parts of the internet, increasing costs for website owners and content creators. Thankfully, there's a new way to thwart this bot swarm. If you're struggling to block them entirely, you can send them down new digital rabbit holes where they ingest content garbage. One software developer recently called this "diabolical" — in a good way. Absolutely diabolical Cloudflare feature. love to see it — hibakod (@hibakod) April 25, 2025 It's called AI Labyrinth, and it's a tool from Cloudflare. Described as a "new mitigation approach," AI Labyrinth uses generative AI not to inform, but to mislead. When Cloudflare detects unauthorized activity, typically from bots ignoring "no crawl" directives, it deploys a trap: a maze of convincingly real but irrelevant AI-generated content designed to waste bots' time and chew through AI companies' computing power. Cloudflare pledged in a recent announcement that this is only the first iteration of using generative AI to thwart bots. Digital gibberish Unlike traditional honeypots, AI Labyrinth creates entire networks of linked pages invisible to humans but highly attractive to bots. These decoy pages don't affect search engine optimization and aren't indexed by search engines. They are specifically tailored to bots, which get ensnared in a meaningless loop of digital gibberish. When bots follow the maze deeper, they inadvertently reveal their behavior, allowing Cloudflare to fingerprint and catalog them. These data points feed directly into Cloudflare's evolving machine learning models, strengthening future detection for customers. Will Allen, VP of Product at Cloudflare, told me that more than 800,000 domains have fired up the company's general AI Bot blocking tool. AI Labyrinth is the next weapon to wield when sneaky AI companies get around blockers. Cloudflare hasn't released data on how many customers use AI Labyrinth, which suggests it's too early for major adoption. "It's still very new, so we haven't released that particular data point yet," Allen said. I asked him why AI bots are still so active if most of the internet's data has already been scraped for model training. "New content," Allen replied. "If I search for 'what are the best restaurants in San Francisco,' showing high-quality content from the past week is much better than information from a year or two prior that might be out of date." Turning AI against itself Bots are not just scraping old blog posts, they're hungry for the freshest data to keep AI outputs relevant. Cloudflare's strategy flips this demand on its head. Instead of serving up valuable new content to unauthorized scrapers, it offers them an endless buffet of synthetic articles, each more irrelevant than the last. As AI scrapers become more common, innovative defenses like AI Labyrinth are becoming essential. By turning AI against itself, Cloudflare has introduced a clever layer of defense that doesn't just block bad actors but exhausts them. For web admins, enabling AI Labyrinth is as easy as toggling a switch in the Cloudflare dashboard. It's a small step that could make a big difference in protecting original content from unauthorized exploitation in the age of AI.

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

Yahoo

02-04-2025

Yahoo

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

The Wikimedia Foundation, the umbrella organization of Wikipedia and a dozen or so other crowdsourced knowledge projects, said on Wednesday that bandwidth consumption for multimedia downloads from Wikimedia Commons has surged by 50% since January 2024. The reason, the outfit wrote in a blog post Tuesday, isn't due to growing demand from knowledge-thirsty humans, but from automated, data-hungry scrapers looking to train AI models. "Our infrastructure is built to sustain sudden traffic spikes from humans during high-interest events, but the amount of traffic generated by scraper bots is unprecedented and presents growing risks and costs," the post reads. Wikimedia Commons is a freely accessible repository of images, videos and audio files that are available under open licenses or are otherwise in the public domain. Digging down, Wikimedia says that almost two-thirds (65%) of the most "expensive" traffic -- that is, the most resource-intensive in terms of the kind of content consumed -- was from bots. However, just 35% of the overall pageviews comes from these bots. The reason for this disparity, according to Wikimedia, is that frequently-accessed content stays closer to the user in its cache, while other less-frequently accessed content is stored further away in the "core data center," which is more expensive to serve content from. This is the kind of content that bots typically go looking for. "While human readers tend to focus on specific – often similar – topics, crawler bots tend to 'bulk read' larger numbers of pages and visit also the less popular pages," Wikimedia writes. "This means these types of requests are more likely to get forwarded to the core datacenter, which makes it much more expensive in terms of consumption of our resources." The long and short of all this is that the Wikimedia Foundation' site reliability team are having to spend a lot of time and resources blocking crawlers to avert disruption for regular users. And all this before we consider the cloud costs that the Foundation is faced with. In truth, this represents part of a fast-growing trend that is threatening the very existence of the open internet. Last month, software engineer and open source advocate Drew DeVault bemoaned the fact that AI crawlers ignore " files that are designed to ward off automated traffic. And 'pragmatic engineer' Gergely Orosz also complained last week that AI scrapers from companies such as Meta have driven up bandwidth demands for his own projects. While open source infrastructure, in particular, is in the firing line, developers are fighting back with "cleverness and vengeance," as TechCrunch wrote last week. Some tech companies are doing their bit to address the issue, too — Cloudflare, for example, recently launched AI Labyrinth, which uses AI-generated content to slow crawlers down. However, it's very much a cat-and-mouse game that could ultimately force many publishers to duck for cover behind logins and paywalls — to the detriment of everyone who uses the web today.

Latest news with #AILabyrinth

A new, 'diabolical' way to thwart Big Tech's data-sucking AI bots: Feed them gibberish

AI crawlers cause Wikimedia Commons bandwidth demands to surge 50%

Get Started Now: Download the App