Latest news with #PerplexityBot

Business Insider
2 days ago
- Business
- Business Insider
An AI data trap catches Perplexity impersonating Google
If you want to succeed in AI, a good hack would be to impersonate Google. You just can't get caught. This is what just happened to Perplexity, a startup that competes with ChatGPT, Google's Gemini, and other generative AI services. Quality data is crucial for success in AI, but tech companies don't want to pay for this, so they crawl the web and scrape information for free, often without permission. This has sparked a backlash by some content creators and others interested in preserving the incentives that built the web. Cloudflare and its CEO, Matthew Prince, have stormed into this battle with new features that help websites block unwanted AI bot crawlers. Cloudflare is an infrastructure, security, and software company that helps run about 20% of the internet. It thrives when the web does well, hence its interest in helping sites get paid for content. Some Cloudflare customers recently complained to the company that Perplexity was evading these blocks and continued to scrape and collect data without permission. So, CloudFlare set a digital trap and caught this startup red-handed, according to a Monday blog describing the escapade. "Some supposedly 'reputable' AI companies act more like North Korean hackers," Prince wrote on X on Monday. "Time to name, shame, and hard block them." Perplexity didn't respond to a request for comment. The bait: Honeytrap domains and locked doors Cloudflare created entirely new, unpublished websites and configured them with files that explicitly blocked all crawlers — including Perplexity's declared bots, PerplexityBot and Perplexity-User. These test sites had no public links, search engine entries, or metadata that would normally make them discoverable. Yet, when Cloudflare queried Perplexity's AI with questions about these specific sites, the startup's service responded with detailed information that could only have come from those restricted pages. The conclusion? Perplexity had accessed the content despite being clearly told not to. The cloak: How Perplexity masked its crawl Perplexity initially crawled these sites using its official user-agent string, complying with standard protocols. However, Cloudflare said it discovered that once blocked, Perplexity resorted to stealth tactics. Cloudflare found that Perplexity began deploying undeclared crawlers disguised as normal web browsers and sending requests from unknown or rotated IP addresses and unofficial ASNs, [what is ASN? write out on first ref?] which are crucial identifiers that help route internet traffic efficiently. When its official crawlers were blocked, Perplexity also used a generic web browser designed to impersonate Google's Chrome browser on Apple Mac computers. (Business Insider asked Google whether it has told Perplexity to stop impersonating Chrome. Google did not respond). According to Cloudflare, Perplexity has been making millions of such "stealth" requests daily across tens of thousands of web domains. This behavior not only violated web standards, but also betrays the fundamental trust that underpins the functioning of the open web, Cloudflare explained. The comparison: How OpenAI gets it right To emphasize what good bot behavior looks like, Cloudflare compared Perplexity's conduct to that of OpenAI's crawlers, which scrape data for developing ChatGPT and giant AI models such as the upcoming GPT-5. When OpenAI's bots encountered a file or a similar block, they simply backed off. No circumvention. No masking. No backdoor crawling, according to Cloudflare tests. The Fallout: De-verification and blocking As a result of these findings, Cloudflare has de-listed Perplexity as a verified bot and rolled out new detection and blocking techniques across its network. Cloudflare's takedown serves as a cautionary tale in the AI arms race. While the web shifts toward stronger control over data access and usage, actors who flout these evolving norms may find themselves not just blocked, but publicly called out. In an era where AI systems are hungry for training data, Cloudflare's sting operation is a signal to startups and established players alike: Respect the rules of the web, or risk being exposed.

Engadget
3 days ago
- Engadget
Perplexity is allegedly scraping websites it's not supposed to, again
Web crawlers deployed by Perplexity to scrape websites are allegedly skirting restrictions, according to a new report from Cloudflare. Specifically, the report claims that the company's bots appear to be "stealth crawling" sites by disguising their identity to get around files and firewalls. is a simple file websites host that lets web crawlers know if they can scrape a websites' content or not. Perplexity's official web crawling bots are "PerplexityBot" and "Perplexity-User." In Cloudflare's tests, Perplexity was still able to display the content of a new, unindexed website, even when those specific bots were blocked by The behavior extended to websites with specific Web Application Firewall (WAF) rules that restricted web crawlers, as well. Cloudflare believes that Perplexity is getting around those obstacles by using "a generic browser intended to impersonate Google Chrome on macOS" when prohibits its normal bots. In Cloudlfare's tests, the company's undeclared crawler could also rotate through IP addresses not listed in Perplexity's official IP range to get through firewalls. Cloudflare says that Perplexity appears to be doing the same thing with autonomous system numbers (ASNs) — an identifier for IP addresses operated by the same business — writing that it spotted the crawler switching ASNs "across tens of thousands of domains and millions of requests per day." Engadget has reached out to Perplexity for comment on Cloudflare's report. We'll update this article if we hear back. Up-to-date information from websites is vital to companies training AI models, especially as service's like Perplexity are used as replacements for search engines. Perplexity has also been caught in the past circumventing the rules to stay up-to-date. Multiple websites reported in 2024 that Perplexity was still accessing their content despite them forbidding it in — something the company blamed on the third-party web crawlers it was using at the time. Perplexity later partnered with multiple publishers to share revenue earned from ads displayed alongside their content, seemingly as a make-good for its past behavior. Stopping companies from scraping content from the web will likely remain a game of whack-a-mole. In the meantime, Cloudflare has removed Perplexity's bots from its list of verified bots and implemented a way to identify and block Perplexity's stealth crawler from accessing its customers' content.


The Verge
3 days ago
- Business
- The Verge
Cloudflare says Perplexity's AI bots are ‘stealth crawling' blocked sites
The AI search startup Perplexity is allegedly skirting restrictions meant to stop its AI web crawlers from accessing certain websites, according to a report from Cloudflare. In the report, Cloudflare claims that when Perplexity encounters a block, the startup will conceal its crawling identity 'in an attempt to circumvent the website's preferences.' The report only adds to concerns about Perplexity vacuuming up content without permission, as the company got caught barging past paywalls and ignoring sites' files last year. At the time, Perplexity CEO Aravind Srinivas blamed the activity on third-party crawlers used by the site. Now, Cloudflare, one of the world's biggest internet architecture providers, says it received complaints from customers who claimed that Perplexity's bots still had access to their websites even after putting their preference in their websites' file and by creating Web Application Firewall (WAF) rules to restrict access to the startup's AI bots. To test this, Cloudflare says it created new domains with similar restrictions against Perplexity's AI scrapers. It found that the startup will first attempt to access the sites by identifying itself as the names of its crawlers: 'PerplexityBot' or 'Perplexity-User.' But if the website has restrictions against AI scraping, Cloudflare claims Perplexity will change its user agent — the bit of information that tells a website what kind of browser and device you're using, or if the visitor is a bot — to 'impersonate Google Chrome on macOS.' Cloudflare says this 'undeclared crawler' uses 'rotating' IP addresses that the company doesn't include on the list of IP addresses used by its bots. Additionally, Cloudflare claims that Perplexity changes its autonomous system networks (ASN), a number used to identify groups of IP networks controlled by a single operator, to get around blocks as well. 'This activity was observed across tens of thousands of domains and millions of requests per day,' Cloudflare writes. In a statement to The Verge, Perplexity spokesperson Jesse Dwyer called Cloudflare's report a 'publicity stunt,' adding that 'there are a lot of misunderstandings in the blog post.' Cloudflare has since de-listed Perplexity as a verified bot and has rolled out methods to block Perplexity's 'stealth crawling.' Cloudflare CEO Matthew Prince has been outspoken about AI's 'existential threat' to publishers. Last month, the company started letting websites ask AI companies to pay to crawl their content, and began blocking AI crawlers by default. Posts from this author will be added to your daily email digest and your homepage feed. See All by Emma Roth Posts from this topic will be added to your daily email digest and your homepage feed. See All AI Posts from this topic will be added to your daily email digest and your homepage feed. See All News Posts from this topic will be added to your daily email digest and your homepage feed. See All Tech Posts from this topic will be added to your daily email digest and your homepage feed. See All Web