Amazon Web Services (AWS) is investigating Perplexity AI, a rapidly growing AI search startup, for allegedly violating AWS rules by scraping content from websites that explicitly prohibited such activity. This scrutiny follows reports from Wired and Forbes about Perplexity’s practices, raising questions about ethical AI usage and adherence to web standards.
Perplexity AI, valued at $3 billion and backed by the Jeff Bezos Family Fund and Nvidia, is accused of ignoring the Robots Exclusion Protocol, a web standard that allows websites to specify which pages should not be accessed by bots via a robots.txt
file. While this protocol is not legally binding, it is widely respected in the tech community.
AWS spokesperson Patrick Neighorn confirmed the investigation, emphasizing that AWS customers must comply with the robots.txt
guidelines. The company prohibits any abusive or illegal activities, and compliance with these standards is a requirement.
Allegations of Data Scraping
Wired reported that Perplexity’s AI-powered search engine appeared to access content from sites like Condé Nast, The Guardian, Forbes, and The New York Times, despite restrictions in their robots.txt
files. An IP address linked to Perplexity’s use of AWS was identified as the source of unauthorized scraping activities.
Perplexity’s Response
Perplexity CEO Aravind Srinivas initially defended the company’s practices, claiming that the allegations showed a misunderstanding of their technology. He later attributed the scraping to a third-party service used for web crawling, without disclosing the company’s name due to a nondisclosure agreement. When asked whether Perplexity would stop the third party from crawling Wired, Srinivas’s response was non-committal.
Perplexity spokesperson Sara Platnick stated that the company complies with AWS’s terms and that their PerplexityBot respects robots.txt
. However, she acknowledged that the bot might ignore these files when a user specifically requests a URL, likening this to manual access by a user.
Industry Reaction
The digital content industry has responded with concern. Jason Kint, CEO of Digital Content Next, criticized Perplexity’s practices, emphasizing that AI companies should not repurpose publishers’ content without permission. Forbes’s Chief Content Officer, Randall Lane, accused Perplexity of plagiarism, describing their practices as “cynical theft” and highlighting the lack of proper attribution in AI-generated articles.
The investigation into Perplexity AI reflects broader tensions in the tech industry regarding content scraping and copyright. As AI companies increasingly rely on web data for training, the line between fair use and exploitation becomes blurred. While companies like Google and OpenAI acknowledge using publicly available data, the lack of transparency about training datasets continues to fuel debate.