![]() |
|
![]() |
| Presumably the issue is more the travel guides/Time Out/Tripadvisor type websites.
They make money by you reading their stuff, not by you actually spending money in the place. |
![]() |
| Google snippets are hilariously wrong, absurdly often; I was recently searching for things while traveling and I can easily imagine relying on snippets getting people into actual trouble. |
![]() |
| Google has been in trouble for doing so several times in the past and removed key features because of it. Examples: Viewing cached pages, linking directly to images, summarized news articles. |
![]() |
| >Travel bloggers and vloggers
I've no doubt some good ones exist, but my instinct is to ignore every word this industry says because it's paid placement and our world is run by advertisers. |
![]() |
| Ironically, I’ve just started asking LLMs to summarize paywalled content, and if it doesn’t answer my question I’ll check web archives or ask it for the full articles text. |
![]() |
| Ah, you're correct, my bad.
I don't personally have a problem with spoofing user agents, but yeah, they're either spoofing or for some reason they're truly using a non-headless Chrome. |
![]() |
| Similarly, for sites which configure robots.txt to disallow all bots except Googlebot, I don't lose sleep about new search engines taking that with a grain of salt. |
![]() |
| This is exactly the concern and there’s a lot of comments just completely ignoring it or willfully conflating.
Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data. |
![]() |
| > TFA demonstrates that they got a hit on their site
Whats stopping perplexity caching this info say for 24 hours, and then redisplaying it to the next few hundred people who request it? |
![]() |
| TECHNICAL ANALYSIS
The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past. Distributing your own scripts and software to process data is not the same as distributing arbitrary data those scripts encountered on the internet for which you don’t have a license. If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription. But if that reader then sent the article down to a remote server to be processed for distribution to unlimited numbers of people, it would be “pirating” that information. The problem is that much of the Web is not properly guarded against this. Xanadu had ideas about micropayments 30 years ago. Take a look at what I am building using the current web: https://qbix.com/ecosystem LEGAL ANALYSIS Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail. In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it. There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop... Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software. |
![]() |
| A user-agent requests the file using your credentials, eg a cookie or public key signature.
It is transforming the content for you, an authorized party. That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this. If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made? I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem |
![]() |
| Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp. |
![]() |
| I'm having difficulty grasping the concept. Only a fool would trust any HTTP headers such as User-Agent sent by a random unauthenticated client. Your expenses are your problem. |
![]() |
| But nobody is arguing for that. Instead, what the server owners want is to mandate the clients connecting to them to provide enough information to reliably reject such connections. |
![]() |
| Let’s differentiate between:
1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user 2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content? THE REQUIREMENT TO OPEN SOURCE WEIGHTS Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available… …that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.) TECHNICAL ANALYSIS The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past. Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license. If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription. LEGAL ANALYSIS Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail. In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it. There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop... Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software. |
![]() |
| Just to go a little bit more into detail on this, because the article and most of the conversation here is based on a big misunderstanding:
robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages. Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user. The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all. If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links. This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here. |
![]() |
| You are definitionally incorrect. From Wikipedia:
> robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit. From robotstxt.org/orig.html (the original proposed specification), there is a bit about "recursive" behaviour, but the last paragraph indicates "which parts of their server should not be accessed". > WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page. > In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting). > These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution. The draft RFC at robotstxt.org/norobots-rfc.txt, the definition is a little more strict about "recursive", but indicates that heuristics used and/or time spacing do not make it less a robot. On robotstxt.org/faq/what.html, there is a paragraph: > Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images). One might argue that the misbehaviour of Perplexity on this matter is "at the instruction" of a human, but as Perplexity does not present itself as a web browser, but a data processing entity, it’s clearly not a web browser. Here's what would be permitted unequivocally, even on a site that blocks bad actors like Perplexity: a browser extension that used Perplexity's LLM to pretend to summarize but actually shorten the content (https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...) when you visit the page as long as that summary were not saved in Perplexity's data. |
![]() |
| And its up to the client to send as many requests as they see fit, it still called a DDOS attack when overdone regardless of the available freedom that the client has to do it. |
![]() |
| If you want summaries from my website, go to my website. I want a way to deny any licence to any third-party user agent that will apply machine learning on my content, whether you initiated the request or not.
LLMs — and more importantly the companies that train and operate them — should not be trusted at all, especially for so-called "summarization": https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu... While Perplexity may be operating against a particular URL based on a direct request from you, they are acting improperly when they "summarize" a website as they have an implicit (and sometimes explicit if there's a paywall) licence to read and render the content as provided, but not to process and redistribute such content. There needs to be something stronger than robots.txt, where I can specify the uses permitted by indirect user access (in my case, search indexing would be the only permitted use case; no LLM training, no LLM summarization, no proxying, no "sanitization" by parental proxies, etc.). |
![]() |
| What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent. |
![]() |
| To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response. |
![]() |
| The legality of their behavior is not currently well defined, because it's unprecedented. Fair use permits transformative works. It has yet to be decided whether LLMs and their output qualify as transformative, or even if the training is capable of infringing copyright of an individual work in the first place if they're not reproducing it. In fact, there's a good amount of evidence which indicates that fair use _does_ apply, given how Google operates and what they've argued successfully (https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...).
Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license. You might not like the idea of your blog posts or other publicly posted materials being used to train LLMs, but that doesn't make it illegal (morality is subjective and I'm not about to argue one way or another). If it's really that much of a problem, you _do_ have the ability to remove your information from public accessibility, or otherwise protect it against LLM ingestion (IP restrictions, etc.). edit: I am not a lawyer (this is likely obvious to any lawyers out there); this is my personal take. |
![]() |
| > AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff.
Funnily enough they do understand that having your own product used to build a competing product is uncool, they just don't care unless it's happening to them. https://openai.com/policies/terms-of-use/ > What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example [...] using Output to develop models that compete with OpenAI. |
![]() |
| These "some people" would not fall under the "normal people" that I specifically said. but you go right ahead and keep thinking they are normal so you can make caveats on an internet forum. |
![]() |
| > Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity.
LLMs wouldn't hallucinate so much if they did that, either. |
![]() |
| How is a human reading a book in any way related or comparable to a machine ingesting millions of books per day with the goal of stealing their content and replacing them? |
![]() |
| Your response is fair and I hope you didn't take my message personally. I agree with you, AI is just a tool same as countless others that can be used for good or evil. |
![]() |
| ...which is a great argument for how unjust is a law that only protects those that can afford it.
Cheaper processes to protect smaller creators in cases like these is what is really needed. |
![]() |
| To me, it's like trying to "solve The Pirate Bay" by making all the stuff they share public domain.
But thank you for sharing your perspective, I appreciate that. |
![]() |
| I guess people just LOVE twisting themselves in knots over some "ethical scandals" or whatnot. Maybe there's a statement on American puritanism hiding somewhere here... |
![]() |
| This. Medallion systems in NYC were gamed by a guy who let people literally bet on its as if it were an asset. The prices went to a million per until the bubble burst. True story |
![]() |
| So? They have a market cap of $150 billion. If at the start they had decided "oh well let's not bother since what we are doing is legally ambiguous" they would have a market cap of $0. |
The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.
The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.
Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.
I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.