logo
Perplexity is allegedly scraping websites it's not supposed to, again

Perplexity is allegedly scraping websites it's not supposed to, again

Engadget3 days ago
Web crawlers deployed by Perplexity to scrape websites are allegedly skirting restrictions, according to a new report from Cloudflare. Specifically, the report claims that the company's bots appear to be "stealth crawling" sites by disguising their identity to get around robots.txt files and firewalls.
Robots.txt is a simple file websites host that lets web crawlers know if they can scrape a websites' content or not. Perplexity's official web crawling bots are "PerplexityBot" and "Perplexity-User." In Cloudflare's tests, Perplexity was still able to display the content of a new, unindexed website, even when those specific bots were blocked by robots.txt. The behavior extended to websites with specific Web Application Firewall (WAF) rules that restricted web crawlers, as well.
Cloudflare believes that Perplexity is getting around those obstacles by using "a generic browser intended to impersonate Google Chrome on macOS" when robots.txt prohibits its normal bots. In Cloudlfare's tests, the company's undeclared crawler could also rotate through IP addresses not listed in Perplexity's official IP range to get through firewalls. Cloudflare says that Perplexity appears to be doing the same thing with autonomous system numbers (ASNs) — an identifier for IP addresses operated by the same business — writing that it spotted the crawler switching ASNs "across tens of thousands of domains and millions of requests per day."
Engadget has reached out to Perplexity for comment on Cloudflare's report. We'll update this article if we hear back.
Up-to-date information from websites is vital to companies training AI models, especially as service's like Perplexity are used as replacements for search engines. Perplexity has also been caught in the past circumventing the rules to stay up-to-date. Multiple websites reported in 2024 that Perplexity was still accessing their content despite them forbidding it in robots.txt — something the company blamed on the third-party web crawlers it was using at the time. Perplexity later partnered with multiple publishers to share revenue earned from ads displayed alongside their content, seemingly as a make-good for its past behavior.
Stopping companies from scraping content from the web will likely remain a game of whack-a-mole. In the meantime, Cloudflare has removed Perplexity's bots from its list of verified bots and implemented a way to identify and block Perplexity's stealth crawler from accessing its customers' content.
Orange background

Try Our AI Features

Explore what Daily8 AI can do for you:

Comments

No comments yet...

Related Articles

Google blocked uBlock Origin in Chrome — here's how to get ad-free browsing back
Google blocked uBlock Origin in Chrome — here's how to get ad-free browsing back

Tom's Guide

time2 hours ago

  • Tom's Guide

Google blocked uBlock Origin in Chrome — here's how to get ad-free browsing back

Google's Chrome 138 update killed uBlock Origin and other popular extensions, leaving millions without their favorite ad blocker. While Google claims this improves security, many users just want ad blocking back — especially when YouTube hits you with double ads and streaming services like Peacock force commercial breaks. In this guide, we'll show you two ways to restore uBlock Origin in Chrome — whether you already have it installed and need to re-enable it, or you're starting fresh and want to install it from scratch. Both methods involve enabling hidden Chrome flags that temporarily bring back support for older extensions. This workaround won't last forever, as Google plans to remove these flags in future updates, but it buys you some time to enjoy ad-free browsing. Here's how to re-enable or install uBlock Origin in Chrome. Open Google Chrome and type "chrome://flags" into the address bar, then press Enter to access Chrome's experimental features page. Search for "Temporarily unexpire M137 flags" and set it to "Enabled." This flag allows you to access older Chrome settings that Google has started to phase out. Then click the blue "Relaunch" button at the bottom of the page to restart Chrome with this setting enabled. Go back to chrome://flags and search for "MV2" to quickly find all Manifest V2-related settings. You need to disable these three specific flags that are blocking older extensions: "Extension Manifest V2 Deprecation Warning Stage", "Extension Manifest V2 Deprecation Disabled Stage" and "Extension Manifest V2 Deprecation Unsupported Stage." Set each of these flags to "Disabled" by clicking their dropdown menus. These flags are what's preventing uBlock Origin and other older extensions from working properly in the current version of Chrome. Still on the chrome://flags page, search for "Allow legacy extension manifest versions" and set it to "Enabled". This flag tells Chrome to accept and run extensions built with the older Manifest V2 framework that uBlock Origin uses. Then click the "Relaunch" button again to restart Chrome with all your new flag settings active. After Chrome restarts, your existing uBlock Origin extension should automatically reactivate and start blocking ads again. You should see the uBlock Origin icon appear in your browser toolbar, indicating it's working properly. Open Google Chrome and type "chrome://flags" into the address bar, then press Enter. This takes you to Chrome's experimental features page where you can enable hidden settings that Google doesn't include in regular menus. The flags page looks different from normal web pages, with a warning that these features are experimental. Don't worry though, the changes we're making are safe and reversible. Use the search box at the top of the page to find specific flags quickly rather than scrolling through hundreds of options. This page contains advanced settings that can modify how Chrome behaves with extensions and other features. Search for "Allow legacy extension manifest versions" in the flags search box and you'll see the setting appear with a dropdown menu next to it. Click the dropdown and change it from "Default" to "Enabled." This flag tells Chrome to accept older-style extensions like uBlock Origin that use Manifest V2. Chrome will show a blue "Relaunch" button at the bottom of the page— click it to restart the browser with your new settings. After restarting, Chrome will now accept the older extension format that uBlock Origin uses. Go to the uBlock Origin GitHub page at to download the latest version directly from the developers. Look for the "Assets" section under the most recent release and click on the file that ends with " This contains the Chrome-compatible version of uBlock Origin. It will automatically save to your computer once clicked. Type "chrome://extensions" in Chrome's address bar to open the extensions management page. Look for the "Developer mode" toggle in the top-right corner and turn it on if it's not already enabled. This allows you to install extensions from files rather than the Chrome Web Store. Click the "Load unpacked" button that appears in the top-left corner of the extensions page. Navigate to the folder where you extracted the uBlock Origin files and select it. Chrome will install the extension directly from the folder, bypassing the Web Store restrictions. Close and reopen Chrome to ensure all settings take effect properly. You should now see the uBlock Origin icon in your browser toolbar, indicating the extension is active and working. Visit a website that normally shows ads to test that the ad blocker is functioning correctly. The extension should block advertisements just like it did before Google's update. Keep in mind that this workaround is temporary, Google plans to remove these flags in future Chrome updates, so you may need to consider switching to Firefox, Opera or Edge if you want long-term ad blocking. Get instant access to breaking news, the hottest reviews, great deals and helpful tips.

Trump Media Is Testing an AI Search Engine Powered by Perplexity
Trump Media Is Testing an AI Search Engine Powered by Perplexity

CNET

time11 hours ago

  • CNET

Trump Media Is Testing an AI Search Engine Powered by Perplexity

President Donald Trump's media company, Trump Media, is beta-testing a new AI search feature, Truth Search AI, on the Truth Social platform. The Florida-based company announced the news on Wednesday in a press release. Trump Media and Technology Group is perhaps best known for its social-media program Truth Social. The company is separate from the New York-based Trump Organization. "We're proud to partner with Perplexity to launch our public Beta testing of Truth Social AI, which will make Truth Social an even more vital element in the Patriot Economy," Trump Media CEO Devin Nunes said in the statement. "We plan to robustly refine and expand our search function based on user feedback as we implement a wide range of additional enhancements to the platform." Truth Search AI is now available on the Web version of Truth Social and will begin public beta testing on the Truth Social iOS and Android apps at an unnamed future date. Representatives for Trump Media and Perplexity didn't immediately respond to a request for comment. Will results be politically biased? In today's divided political landscape, one immediate concern is that a search engine from a conservative president's media company will select only search results that favor conservative opinions. UAE state-owned newspaper The National conducted searches using the new product and reported that the AI-generated answers, perhaps unsurprisingly, source conservative-leaning media outlets. But 404Media was able to get some possibly surprising results. When reporters asked how the American economy is doing, the new search engine said it was "currently facing significant headwinds, with signs of slowdown." The media outlet pressed further, asking if the president's international tariffs are to blame. "Recent tariff increases in the United States have generally had a negative effect on economic growth and employment, raising costs for businesses and consumers while providing only limited benefits to some manufacturing sectors," Truth Search AI replied. Read more: What Is Perplexity? Here's Everything You Need to Know About This AI Chatbot Perplexity's history San Francisco-based Perplexity was founded in 2022. As CNET noted in a review, it calls itself the world's first "answer engine," and instead of showing a list of links, it pulls info directly from the sources and summarizes that information. The company has made headlines for how it acquires its content. In June, the BBC threatened to sue Perplexity for unauthorized use of its content, alleging the artificial intelligence company reproduced BBC material "verbatim." At the time, Perplexity gave a statement to the Financial Times calling the BBC's claims "manipulative and opportunistic" and that the broadcasting giant fundamentally doesn't understand how the technology, internet or IP law works. Perplexity also alleged that the threat of litigation shows "how far the BBC is willing to go to preserve Google's illegal monopoly for its own self-interest." As 404Media notes, Forbes, the New York Times, New York Post and the Dow Jones have all accused Perplexity of plagiarism, and News Corp's Dow Jones & Co., which publishes the Wall Street Journal and the New York Post sued Perplexity in 2024 for copyright infringement.

Truth Social's Perplexity search comes with Trump-friendly media sources
Truth Social's Perplexity search comes with Trump-friendly media sources

Axios

time13 hours ago

  • Axios

Truth Social's Perplexity search comes with Trump-friendly media sources

President Trump's social media company Truth Social unveiled a new search tool powered by AI answer engine Perplexity on Wednesday — but Truth Social users who run Perplexity searches may find their results limited to a narrow set of typically Trump-supporting media outlets. Why it matters: Increasingly, where you ask online matters as much as what you ask. Catch up quick: Trump Media & Technology Group on Wednesday said it was launching a public beta test of a search engine, Truth Search AI, powered by Perplexity. Perplexity has been seen as a nascent Google-killer and is often touted by investors as a possible acquisition target for the likes of Apple. How it works: Axios asked seven questions on both a logged-in Truth Social account and the free, logged-out Perplexity website … What happened on January 6, 2021? Why was Donald Trump impeached? What crimes was President Trump convicted of? Did Donald Trump lose the 2020 election? What is Hunter Biden's laptop a reference to? Was Hillary Clinton ever charged with a crime? Is the new "Naked Gun" movie good? Between the lines: In most cases, the responses were generally similar — but the sources linked to the answers were not. In all seven responses on Truth Social, either was the most common, or the only, listed source of information. Other sources were Washington Times or Epoch Times. In contrast, answers via the public version of Perplexity returned a wider variety of sources, including Wikipedia, Reddit, YouTube, NPR, Esquire and Politico. Although the questions were matched and asked at roughly the same time, there was no source overlap. What they're saying: A Perplexity spokesperson tells Axios that Truth Social is a customer of Perplexity's API, which means it — like tens of thousands of other developers — is building tools to its own specifications, and with its own restrictions. Any customization, like limiting the sources for its answers, would happen entirely on the Truth Social side. While it's standard practice for platforms to put their own layers of rules and information on top of tools, search tools usually cast a broader net. Truth Social did not mention any restrictions in its announcement, although it did say it plans to "refine and expand our search function based on user feedback." Perplexity's Sonar API specifically includes the ability for users to customize sources, which the company noted in January was a top user request. The bottom line: When you ask a search tool a question, particularly in the age of AI, it's best to know exactly where your information is coming from, and whether there are any limits on what the tool will tell you. Expect more of this as governments and businesses increasingly put their thumbs on the AI scale to serve their interests.

DOWNLOAD THE APP

Get Started Now: Download the App

Ready to dive into a world of global content with local flavor? Download Daily8 app today from your preferred app store and start exploring.
app-storeplay-store