Latest news with #scraping


The Verge
4 days ago
- Business
- The Verge
Cloudflare says Perplexity's AI bots are ‘stealth crawling' blocked sites
The AI search startup Perplexity is allegedly skirting restrictions meant to stop its AI web crawlers from accessing certain websites, according to a report from Cloudflare. In the report, Cloudflare claims that when Perplexity encounters a block, the startup will conceal its crawling identity 'in an attempt to circumvent the website's preferences.' The report only adds to concerns about Perplexity vacuuming up content without permission, as the company got caught barging past paywalls and ignoring sites' files last year. At the time, Perplexity CEO Aravind Srinivas blamed the activity on third-party crawlers used by the site. Now, Cloudflare, one of the world's biggest internet architecture providers, says it received complaints from customers who claimed that Perplexity's bots still had access to their websites even after putting their preference in their websites' file and by creating Web Application Firewall (WAF) rules to restrict access to the startup's AI bots. To test this, Cloudflare says it created new domains with similar restrictions against Perplexity's AI scrapers. It found that the startup will first attempt to access the sites by identifying itself as the names of its crawlers: 'PerplexityBot' or 'Perplexity-User.' But if the website has restrictions against AI scraping, Cloudflare claims Perplexity will change its user agent — the bit of information that tells a website what kind of browser and device you're using, or if the visitor is a bot — to 'impersonate Google Chrome on macOS.' Cloudflare says this 'undeclared crawler' uses 'rotating' IP addresses that the company doesn't include on the list of IP addresses used by its bots. Additionally, Cloudflare claims that Perplexity changes its autonomous system networks (ASN), a number used to identify groups of IP networks controlled by a single operator, to get around blocks as well. 'This activity was observed across tens of thousands of domains and millions of requests per day,' Cloudflare writes. In a statement to The Verge, Perplexity spokesperson Jesse Dwyer called Cloudflare's report a 'publicity stunt,' adding that 'there are a lot of misunderstandings in the blog post.' Cloudflare has since de-listed Perplexity as a verified bot and has rolled out methods to block Perplexity's 'stealth crawling.' Cloudflare CEO Matthew Prince has been outspoken about AI's 'existential threat' to publishers. Last month, the company started letting websites ask AI companies to pay to crawl their content, and began blocking AI crawlers by default. Posts from this author will be added to your daily email digest and your homepage feed. See All by Emma Roth Posts from this topic will be added to your daily email digest and your homepage feed. See All AI Posts from this topic will be added to your daily email digest and your homepage feed. See All News Posts from this topic will be added to your daily email digest and your homepage feed. See All Tech Posts from this topic will be added to your daily email digest and your homepage feed. See All Web
Yahoo
4 days ago
- Business
- Yahoo
Perplexity accused of scraping websites that explicitly blocked AI scraping
AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don't want to be scraped, according to internet infrastructure provider Cloudflare. On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities. The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages 'in an attempt to circumvent the website's preferences,' Cloudflare's researchers wrote. AI products like those offered by Perplexity rely on gobbling up large amounts of data from the internet, and AI startups have long scraped text, images, and videos from the internet many times without permission to make their products work. In recent times, websites have tried to fight back by using the web standard file, which tells search engines and AI companies which pages can be indexed and which shouldn't, efforts that have seen mixed results so far. Perplexity appears to be willingly circumventing these blocks by changing its bots 'user agent,' meaning a signal that identifies a website visitor by their device and version type; as well as changing their autonomous system networks, or ASN, essentially a number that identifies large networks on the internet, according to Cloudflare. 'This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals,' read Cloudflare's post. Perplexity spokesperson Jesse Dwyer dismissed Cloudflare's blog post as a 'sales pitch,' adding in an email to TechCrunch that the screenshots in the post 'show that no content was accessed.' In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog 'isn't even ours.' Cloudflare said it first noticed the behavior after its customers complained that Perplexity was crawling and scraping their sites, even after they added rules on their Robots file and for specifically blocking Perplexity's known bots. Cloudflare said it then performed tests to check and confirmed that Perplexity was circumventing these blocks. 'We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked,' according to Cloudflare. The company also said that it has de-listed Perplexity's bots from its verified list and added new techniques to block them. Cloudflare has recently taken a public stance against AI crawlers. Last month, Cloudflare announced the launch of a marketplace allowing website owners and publishers to charge AI scrapers who visit their sites. Cloudflare's chief executive Matthew Prince sounded the alarm at the time, saying AI is breaking the business model of the internet, particularly publishers. Last year, Cloudflare also launched a free tool to prevent bots from scraping websites to train AI. This is not the first time Perplexity is accused of scraping without authorization. Last year, news outlets, such as Wired, alleged Perplexity was plagiarizing their content. Weeks later, Perplexity's CEO Aravind Srinivas was unable to immediately answer when asked to provide the company's definition of plagiarism during an interview with TechCrunch's Devin Coldewey at the Disrupt 2024 conference.


TechCrunch
4 days ago
- Business
- TechCrunch
Perplexity accused of scraping websites that explicitly blocked AI scraping
AI startup Perplexity is crawling and scraping content from websites that have explicitly indicated they don't want to be scraped, according to internet infrastructure provider Cloudflare. On Monday, Cloudflare published research saying it observed the AI startup ignore blocks and hide its crawling and scraping activities. The network infrastructure giant accused Perplexity of obscuring its identity when trying to scrape web pages 'in an attempt to circumvent the website's preferences,' Cloudflare's researchers wrote. AI products like those offered by Perplexity rely on gobbling up large amounts of data from the internet, and AI startups have long scraped text, images, and videos from the internet many times without permission to make their products work. In recent times, websites have tried to fight back by using the web standard file, which tells search engines and AI companies which pages can be indexed and which shouldn't, efforts that have seen mixed results so far. Perplexity appears to be willingly circumventing these blocks by changing its bots 'user agent,' meaning a signal that identifies a website visitor by their device and version type; as well as changing their autonomous system networks, or ASN, essentially a number that identifies large networks on the internet, according to Cloudflare. 'This activity was observed across tens of thousands of domains and millions of requests per day. We were able to fingerprint this crawler using a combination of machine learning and network signals,' read Cloudflare's post. Perplexity spokesperson Jesse Dwyer dismissed Cloudflare's blog post as a 'sales pitch,' adding in an email to TechCrunch that the screenshots in the post 'show that no content was accessed.' In a follow-up email, Dwyer claimed the bot named in the Cloudflare blog 'isn't even ours.' Cloudflare said it first noticed the behavior after its customers complained that Perplexity was crawling and scraping their sites, even after they added rules on their Robots file and for specifically blocking Perplexity's known bots. Cloudflare said it then performed tests to check and confirmed that Perplexity was circumventing these blocks. Techcrunch event Tech and VC heavyweights join the Disrupt 2025 agenda Netflix, ElevenLabs, Wayve, Sequoia Capital — just a few of the heavy hitters joining the Disrupt 2025 agenda. They're here to deliver the insights that fuel startup growth and sharpen your edge. Don't miss the 20th anniversary of TechCrunch Disrupt, and a chance to learn from the top voices in tech — grab your ticket now and save up to $675 before prices rise. Tech and VC heavyweights join the Disrupt 2025 agenda Netflix, ElevenLabs, Wayve, Sequoia Capital — just a few of the heavy hitters joining the Disrupt 2025 agenda. They're here to deliver the insights that fuel startup growth and sharpen your edge. Don't miss the 20th anniversary of TechCrunch Disrupt, and a chance to learn from the top voices in tech — grab your ticket now and save up to $675 before prices rise. San Francisco | REGISTER NOW 'We observed that Perplexity uses not only their declared user-agent, but also a generic browser intended to impersonate Google Chrome on macOS when their declared crawler was blocked,' according to Cloudflare. The company also said that it has de-listed Perplexity's bots from its verified list and added new techniques to block them. Cloudflare has recently taken a public stance against AI crawlers. Last month, Cloudflare announced the launch of a marketplace allowing website owners and publishers to charge AI scrapers who visit their sites. Cloudflare's chief executive Matthew Prince sounded the alarm at the time, saying AI is breaking the business model of the internet, particularly publishers. Last year, Cloudflare also launched a free tool to prevent bots from scraping websites to train AI. This is not the first time Perplexity is accused of scraping without authorization. Last year, news outlets, such as Wired, alleged Perplexity was plagiarizing their content. Weeks later, Perplexity's CEO Aravind Srinivas was unable to immediately answer when asked to provide the company's definition of plagiarism during an interview with TechCrunch's Devin Coldewey at the Disrupt 2024 conference.


The Verge
21-06-2025
- Business
- The Verge
The BBC cracks down on AI scraping.
Posted Jun 20, 2025 at 12:01 PM UTC The BBC cracks down on AI scraping. The British broadcaster has threatened legal action against Perplexity for allegedly using BBC content to train AI, saying that the Perplexity chatbot was regurgitating its content verbatim. This is the first time that the BBC has taken action against an AI company, demanding that Perplexity cease scraping BBC content, delete copies of infringing material, and provide the broadcaster with 'financial compensation.'
Yahoo
17-06-2025
- Business
- Yahoo
Analysts revamp forecast for Nvidia-backed AI stock
Analysts revamp forecast for Nvidia-backed AI stock originally appeared on TheStreet. I have a virtual private server with several services running on it. It has replacements for Google Drive, Whatsapp, and Github (or Gitlab). Getting a sufficiently good internet connection that would allow me to use a real (on-premise) machine instead of a virtual one is very difficult where I live. I've been maintaining this server without any (serious) problems for a couple of years. However, in the past few months, the situation has changed, for the worse. Nothing brings me more joy than an occasional email from my VPS provider telling me that my server's CPU usage has been averaging at 98% for the last 2 hours. My server, which was almost invisible for a very long time, has become a target of scrapers and scanners.I am not alone in having this issue. Many prominent open-source projects had to protect themselves, too, and recently they started using "Anubis" for this. (Not the malware with the same name) Why the sudden change, you might ask? Well, an increasing number of companies think they will be the ones to create this 'incredible artificial intelligence.' So, they are scraping any website, regardless of whether its data is relevant and reliable. The more data they can collect, the better, seems to be the prevailing modus operandi. And once they're done collecting, throw everything into the blender and hope for the best. What if you are a little startup, with the aforementioned goals of writing incredible AI, and you've done the previous step of collecting the data, and now you just need that blender? Perhaps you have some investor money, but can't build that blender yourself. After all, Graphics cards used for AI training cost an arm and a leg. This is where CoreWeave () comes in. Just like the VPS providers that enable people like me who can't use real machines for their servers to use their servers instead, CoreWeave enables companies that can't afford AI servers to do their AI training on its GPU mega clusters. More AI Stocks: Wall Street veteran doubles down on Palantir Analysts double price target of new AI stock backed by Nvidia OpenAI teams up with legendary Apple exec Considering that the company's business model is 'renting' Nvidia () graphics cards, it is not surprising that the company has become Nvidia's largest holding, making up more than 78% of Nvidia's disclosed released its earnings report for Q1 2025 on May 14th. Here are the highlights: Revenue of $981.6 million, a 420% increase year-over-year. Net loss of $314.6 million, a 143% increase YoY. Adjusted EBITDA of $606.1 million, a 480% increase YoY. Guidance for the full year 2025 was: Revenue from $4.9 billion to $5.1 billion Capital expenditures of $20 billion to $23 billion Bank of America analysts, Brad Sills and Carly Liu, shared their opinions on the CoreWeave stock. "In our view, the AI infrastructure [capital expenditures] growth rate is peaking, though still very healthy (estimates are likely to move higher on a larger base), led by OpenAI. OpenAI's ChatGPT is the single largest consumer of AI workloads and is growing at a rapid pace. Therefore, we see solid sustained demand in CoreWeave's AI infrastructure market," said analysts In Q1, CoreWeave expanded its deal with OpenAI bringing the total contract value to $15.9 billion. The company also signed a new hyperscaler customer in Q1. It has also increased the average contract duration to four and a half years from four years since forecasted $21 billion of negative free cash flow for the company through calendar year 2027, driven by high capital expenditures. CoreWeave funds the majority of its capital expenditures with debt. The company managed to lower the interest rate in the recent debt raise of $2 billion to 9.3%, from 11% in calendar year 2024. "However, this remains a small % of the total incremental debt required from here, raising some questions, in our view," continued analysts. Sills and Liu noted that the stock is trading at twenty-five times its calendar year 2027 EBIT estimate, which is a premium to the peer group that is trading at sixteen times the estimate. They set the new price objective for CoreWeave, raising their target from $76 to $185, which is 29 times their calendar year 2027 EBIT estimate (vs. 16x previously), or 0.4 times adjusted for 69% growth. That said, they cut their rating on the stock after CoreWeave's recent rally, arguing there's less room for shares to head higher. "We believe much of the near-term upside has been priced in and downgrade our rating to neutral from buy," concluded revamp forecast for Nvidia-backed AI stock first appeared on TheStreet on Jun 16, 2025 This story was originally reported by TheStreet on Jun 16, 2025, where it first appeared.