（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=40690898

我理解您的观点，并承认这些技术的法律影响是复杂而微妙的。关于合理使用和版权法在机器学习模型中的应用及其对公开内容的使用提出的观点是重要且发人深省的。虽然我对这些问题没有明确的答案，但值得注意的是，知识产权法的解释和执行在不同司法管辖区之间存在很大差异，并且随着技术进步和社会期望的变化而不断发展。作为数字空间中负责任的参与者，我相信我们有责任参与诚实和透明的活动。对于利用机器学习模型生成内容或提取见解的个人和组织来说，应优先考虑保持准确的归因并为原始来源提供适当的信用。这尊重版权和知识产权保护的意图，同时营造有利于持续创新和增长的协作环境。此外，重要的是要记住，这些工具和技术也可以给社会带来巨大的好处，特别是当应用于科学研究、医学诊断和环境建模等领域时。平衡这些方法的广泛采用所带来的潜在危害与其巨大的积极影响仍然是一项具有挑战性的工作，需要利益相关者之间不断进行对话和合作。最后，关于 robots.txt 及其局限性的讨论，确实该协议主要关注爬行行为而不是直接的用户交互。然而，考虑到机器学习能力的进步，也许现在是时候探索替代方法来弥合规则精神与当今和未来技术景观之间的差距了。可能性包括实施改进的元标签和 API，到建立社区标准和指南，以促进开发者生态系统内的信任与合作。最终，找到平衡的解决方案将取决于行业领导者、政策制定者和更广泛社区的积极参与。

There are two different questions at play here, and we need to be careful what we wish for.

The first concern is the most legitimate one: can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

The second concern, though, is can Perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local. The very nature of a "user agent" is to be an automated tool that manipulates content hosted on the internet according to the specifications given to the tool by the user. I have a hard time seeing an argument against Perplexity using this data in this way that wouldn't apply equally to countless tools that we already all use and which companies try with varying degrees of success to block.

I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it. I want to be able to write scripts to manipulate the page and present it in a way that's useful for me. I don't currently use llms this way, but I'm uncomfortable with arguing that it's unethical for them to do that so long as they're citing the source.

It's funny I posted the inverse of this. As a web publisher, I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy.

But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable. A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

There are many benefits to having people visit your content on a property that you own. e.g., say you are a SaaS company and you have a bunch of Help docs. You can analyze traffic in this section of your website to get insights to improve your business: what are the top search queries from my users, this might indicate to me where they are struggling or what new features I could build. In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

> they are decreasing the probability that this user would come to by content (via Google, for example).

Google has been providing summaries of stuff and hijacking traffic for ages.

I kid you not, in the tourism sector this has been a HUGE issue, we have seen 50%+ decrease in views when they started doing it.

We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage.

It's just depressing. I'm more and more convinced that the age of regulations and competition is gone, US does want to have unkillable monopolies in the tech sector and we are all peons.

> Google has been providing summaries of stuff and hijacking traffic for ages.

Yes, Google hijacked images for some time. But in general there has "always" been the option to tell Google not to display summaries etc with these meta tags:

I'm curious about the tourism sector problem. In tourism, I would think the goal would be to promote a location. You want people to be able to easily discover the location, get information about it, and presumably arrange to travel to those locations. If Google gets the information to the users, but doesn't send the tourist to the website, is that harmful? Is it a problem of ads on the tourism website? Or is more of problem of the site creator demonstrating to the site purchaser that the purchase was worthwhile?

We would employ local guides all around the world to craft itinerary plans to visit places, give tips, tricks, recommend experiences and places (we made money by selling some of those through our website) and it was a success.

Customers liked the in depth value of that content and it converted to buys (we sold experiences and other stuff, sort of like getyourguide).

One day all of our content ended up on Google "what time is best to visit the Sagrada Familia" and you would have a copy pasted answer by Google.

This killed a lot of traffic.

Anyway, I just wanted to point out that the previous user was a bit naive taking his fight to LLMs when search engines and OSs have been leeching and hijacking content for ages.

I totally get that it killed your traffic. If a thousand people a day typing in "what time is best to visit the Sagrada Familiar" stopped clicking on the link to your page because Google just told them "4 PM on Thursdays" at the top of the page, you lost a bunch of traffic.

But why did you want the traffic? Was your revenue from ad impressions, or were you perhaps being paid by the city of Barcelona to provide useful information to tourists? If the former, I get that this hurt you. If the latter, was this a failure or a success?

Moreover, if it's the former, then good riddance. An ad-backed site is harming users a little on the margin for the marginal piece of information. Getting the same from a search engine is saving users from that harm.

Parent has the right question here: why did you want the traffic? Did you intend for anything good to happen to those people?. I'm going to guess not; there's hardly a scenario where people who complain about loss traffic and mean that traffic any good.

Now think of the 2nd order effects: they paid money to collect that useful information. If it’s no longer feasible to create such high quality content, it won’t magic itself into existence on its own. It’ll all be just crap and slop in a few years.

> If it’s no longer feasible to create such high quality content, it won’t magic itself into existence on its own. It’ll all be just crap and slop in a few years.

Except it kind of does. Almost all high-quality free content on the Internet has been made by hobbyists just for the sake of doing it, or as some kind of expense (marketing budget, government spending). The free content is not supposed to make money. An honest way of making money with content is putting up a paywall. Monetizing free content creates a conflict of interest, as optimizing value to publisher pulls it in opposite direction than optimizing for value to consumer. Can't save both masters, and all. That's why it's effectively a bullet-proof heuristic, that the more monetization you see on some free content, the more wrong and more shit it is.

Put another way, monetizing the audience is the hallmark of slop.

If your content has a yes/no or otherwise simple, factual answer that can be conveyed in a 1-2 sentence summary, then I don't see this as a problem. You need to adapt your content strategy, as we all do from time to time.

There was never a guarantee -- for anyone in any industry at all -- that what worked in the past will always continue to work. That is a regressive attitude.

However I do have concerns about Google and other monopolies replacing large swaths of people who make their livings doing things that can now be automated. I am not against automation but I don't think the disruption of our entire societal structure and economy should be in the hands of the sociopaths that run these companies. I expect regulation to come into play once the shit hits the fan for more people.

Presumably the issue is more the travel guides/Time Out/Tripadvisor type websites.

They make money by you reading their stuff, not by you actually spending money in the place.

Google snippets are hilariously wrong, absurdly often; I was recently searching for things while traveling and I can easily imagine relying on snippets getting people into actual trouble.

Google has been in trouble for doing so several times in the past and removed key features because of it. Examples: Viewing cached pages, linking directly to images, summarized news articles.

>We paid gazzilions to write quality content for tourists about the most different places just so Google could put it on their homepage. It's just depressing

It's a legitimate complaint, and it sucks for your business. But I think this demonstrates that the sort of quality content you were producing doesn't actually have much value.

>If the "content" had no value, why would google go through the effort of scraping it and presenting it to the user?

They don't present it all, they summarize it.

And let's be serious here, I was being polite because I don't know the OPs business. But 99% of this sort of content is SEO trash and contributes to the wasteland that the internet is becoming. Feel free to point me to the good stuff.

Pedantry aside, let's restate as "present the core thoughts" to the user, which still implies value. I agree that most of google front page results are SEO garbage these days, but that's a separate issue from claiming that are summary of a piece of information removes the original of its value. I'd even argue that it transfers it from one entity to the other in this case.

I would also think that the intrinsic value is different. If there is a hotel on a mountain writing "quality content" about the place, to them it really doesn't matter who "steals" their content, the value is in people going to the hotel on the mountain not in people reading about the hotel on the mountain.

Like to society the value is in the hotel, everything else is just fluff around it that never had any real value to begin with.

> Feel free to point me to the good stuff.

Travel bloggers and vloggers, but that is an entirely different unaffected industry (entertainment/infotainment).

>Travel bloggers and vloggers

I've no doubt some good ones exist, but my instinct is to ignore every word this industry says because it's paid placement and our world is run by advertisers.

It's not that it has no value, it's that there is no established way (other than ad revenue) to charge users for that content. The fact that google is able to monetize ad revenue at least as well as, and probably better than, almost any other entity on the internet, means that big-G is perfectly positioned to cut out the creator -- until the content goes stale, anyway.

> until the content goes stale, anyway

This will be quite interesting in the future. One can usually tell if a blog post is stale, or whether it’s still relevant to the subject it’s presenting. But with LLMs they’ll just aggregate and regurgitate as if it was a timeless fact.

This is already a problem. Content farms have realised that adding "in $current_year" to their headlines helps traffic. It's frustrating when you start reading and realise the content is two years out of date.

The Google summaries (before whatever LLM stuff they're doing now) are 2-3 sentences tops. The content on most of these websites is much, much longer than that for SEO reasons.

It sucks that Google created the problem on both ends, but the content OP is referring to costs way more to produce than it adds value to the world because it has to be padded out to show up in search. Then Google comes along and extracts the actual answer that the page is built around and the user skips both the padding and the site as a whole.

Google is terrible, the attention economy that Google created is terrible. This was all true before LLMs and tools like Perplexity are a reaction to the terrible content world that Google created.

It would be a lot better if Google just prioritised concise websites.

If Google preferred websites that cut the fluff, then website operators would have an incentive to make useful websites, and Google wouldn't have as much of an incentive to provide the answer in a snippet, and everyone wins.

I guess it's hard to rank website quality, so Google just prefers verbose websites.

> Google wouldn't have as much of an incentive to provide the answer in a snippet, and everyone wins.

Google has at least two incentives to provide that answer, both of which wouldn't change. The bad one: they want to keep you on their page too, for usual bullshit attention economy reasons. The good one: users prefer the snippets too.

The user searching for information usually isn't there to marvel at beauty of random websites hiding that information in piles of noise surrounded by ads. They don't care about websites in the first place. They want an answer to the question, so they can get on with whatever it is they're doing. When Google can give them an answer, and this stops them from going from SERP to any website, then that's just few seconds or minutes of life that user doesn't have to waste. Lifespans are finite.

I strongly disagree with you.

The only reason that users prefer snippets is because websites hide the info you are looking for. The problem is that the top ranked search results are ad-infested SEO crap.

If the top ranked website were actually designed with the user in mind, they would not hide the important info. They would present the most important info at the top, and contain additional details below. They would offer the user exactly what they want immediately, and provide further details that the user can read if they want to.

Think of a well written wikipedia article. The summary is probably all that you need, but it's good that the rest of the article with all the detail is there as well. I'm pretty sure that most people prefer a well designed user-centric article to the stupid Google snippet that may or may not answer the question you asked.

Most people looking for info don't look for just a single answer. Often, the answer leads to the next question, or if the answer is surprising, you might want to check out if the source looks credible, etc. Even ads would be helpful, if they were actually relevant (eg. if I am looking for low profile graphic cards, I'd appreciate an ad for a local retailer that has them in stock).

But the problem is that website operators (and Google) just want to distract you, capture your attention, and get you to click on completely irrelevant bullshit, because that is more profitable than actually helping you.

> A tool that runs on-device (like Reader mode) is different because Perplexity is an aggregator service that will continue to solidify its position as a demand aggregator and I will never be able to get people directly on my content.

If I visit your site from Google with my browser configured to go straight to Reader Mode whenever possible, is my visit more useful to you than a summary and a link to your site provided by Perplexity? Why does it matter so much that visitors be directly on your content?

Well for one thing you visiting his site and displaying it via reader mode doesn't remove his ability to sell paid licenses for his content to companies that would like to redistribute his content. Meanwhile having those companies do so for free without a license obviously does.

I asked you this in the other subthread, but what exactly is the moral distinction (I'm not especially interested in the legal one here because our copyright law is horribly broken) between these two scenarios?

* User asks proprietary web browser to fetch content and render it a specific way, which it does

* User asks proprietary web service to fetch content and render it a specific way, which it does

The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?

Traffic numbers, regardless if it using reader mode or not, are used as a basic valuation of a website or page. This is why Alexa rankings have historically been so important.

If Perplexity visit the site once and cache some info to give to multiple users, that is stealing traffic numbers for ad value, but also taking away the ability from the site owner to get realistic ideas of how many people are using the information on their site.

Additionally, this is AI we are talking about. Whos to say that the genrated summary of information is actually correct? The only way to confirm that, or to get the correct information in the first place, is to read the original site yourself.

> The only way to confirm that, or to get the correct information in the first place, is to read the original site yourself.

As someone who uses Perplexity, I often do do this. And I don't think I'm particularly in the minority with this. I think their UI encourages it.

Yeah that's one of the best things about them for me. And then I go to the website and often it's some janky UI with content buried super deep. Or it's like Reddit and I immediately get slammed with login walls and a million annoying pop ups. So I'm quite grateful to have an ability to cut through the noise and non-consistency of the wild west web. I agree the idea that we're somewhat killing traffic to the organic web is kind of sad. But at the same time I still go to the source material a lot, and it enables me to bounce more easily when the website is a bit hostile.

I wonder if it would be slightly less sad if we all had our own decentralized crawlers that simply functioned as extensions of ourselves.

> I wonder if it would be slightly less sad if we all had our own decentralized crawlers that simply functioned as extensions of ourselves.

This is something I'm (slowly) working on myself. I have a local language model server and 30 tb usable storage ready to go, just working on the software :)

>Traffic numbers, regardless if it using reader mode or not, is used as a basic valuation of a website.

I have another comment that says something similar, but: is valuing a website based on basic traffic still a thing? Feels very 2002. It's not my wheelhouse, but if I happened to be involved in a transaction, raw traffic numbers wouldn't hold much sway.

The inaccuracy point is particularly problematic as either they cite you as the source despite possibly warping your content to be incorrect.. or they don't cite you and more directly steal the content. I'm not sure which is worse

> I am fine with folks using my content to train their models because this training does not directly steal any traffic. It's the "train an AI by reading all the books in the world" analogy. But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example). This is unacceptable.

This appears to be self-contradictory. If you let an LLM to be trained* on “all the books” (posts, articles, etc.) in the world, the implication is that your potential readers will now simply ask that LLM. Not only will they pay Microsoft for that privilege, while you would get zilch, but you would not even know they ever read the fruits of your research.

* Incidentally, thinking of information acquisition by an ML model as if it was similar to human reading is a problematic fallacy.

I don't know what the typical usage pattern is, but when I've used Perplexity, I generally do click the relevant links instead of just trusting Perplexity's summary. I've seen plenty of cases where Perplexity's summary says exactly the opposite of the source.

This hits the point exactly, it’s an extension of stuff like Google’s zero click results, they are regurgitating a website’s content with no benefit to the website.

I would say though, it feels like the training argument may ultimately lead to a similar outcome, though it’s a bit more ideological and less tangible than regurgitating the results of a query. Services like chatgpt are already being used a google replacement by many people, so long term it may reduce clicks from search as well.

> But what Perplexity is doing when they crawl my content in response to a user question is that they are decreasing the probability that this user would come to by content (via Google, for example).

Perplexity has source references. I find myself visiting the source references. Especially to validate the LLM output. And to learn more about the subject. Perplexity uses a Google search API to generate the reference links. I think a better strategy is to treat this as a new channel to receive visitors.

The browsing experience should be improved. Mozilla had a pilot called Context Graph. Perhaps Context Graph should be revisited?

> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

This seems like a missing feature for analytics products & the LLMs/RAGs. I don't think searching via an LLM/RAG is going away. It's too effective for the end user. We have to learn to work with it the best we can.

>> In a world where users ask Perplexity these Help questions about my SaaS, Perplexity may answer them and I would lose all the insights because I never get any traffic.

Alternative take: Perplexity is protecting users' privacy by not exposing them to be turned into "insights" by the SaaS.

My general impression is that the subset of complaints discussed in this thread and in the article, boils down to a simple conflict of interest: information supplier wants to exploit the visitor through advertising, upsells, and other time/sanity-wasting things; for that, they need to have the visitor on their site. Meanwhile, the visitors want just the information without the surveillance, advertising and other attention economy dark/abuse patterns.

The content is the bait, and ad-blockers, Google's instant results, and Perplexity, are pulling that bait off the hook for the fish to eat. No surprise fishermen are unhappy. But, as a fish, I find it hard to sympathize.

Ironically, I’ve just started asking LLMs to summarize paywalled content, and if it doesn’t answer my question I’ll check web archives or ask it for the full articles text.

I'm not sure what you mean exactly. If Perplexity is actually doing something with your article in-band (e.g. downloading it, processing it, and present that processed article to the user) then they're just breaking the law.

I've never used that tool (and don't plan to) so I don't know. If they just embed the content in an iframe or something then there's no issue (but then there's no need or point in scraping). If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article ") then that's vanilla infringement, whether they lie about their UA or not.

> If they're just scraping to train then I think you also imply there's no issue. If they're just copying your content (even if the prompt is "Hey Perplexity, summarise this article ") then that's vanilla infringement, whether they lie about their UA or not.

Except, it can't possibly be like that - that would kill the Internet as you know it. It makes sense to consider scrapping for purposes of training as infringement - I personally disagree, I'm totally on the side of AI companies on this one, but there's a reasonable argument there. But in terms of me requesting a summary, and the AI tool doing it server-side before sending it to me, without also adding it to the pile of its own training data? Banning that would mean banning all user-generated content websites, all web viewing or editing tools, web preview tools, optimizing proxies, malware scanners, corporate proxies, hell, maybe even desktop viewers and editing tools.

There are always multiple programs between your website and your user's eyeballs. Most of them do some transformations. Most of them are third-party, usually commercial software. That's how everything works. Software made by "AI company" isn't special here. Trying to make it otherwise is some really weird form of prejudice-driven discrimination.

Sure it is, but which of the many small websites are going to be able to fight them legally? Most companies would go broke before getting a ruling.

Reality is, the law doesn't matter if you're big enough. As long as they're not stealing content from the big ones, they're going to be fine.

Thanks for the link, that's fantastic to hear!

I'm seriously sick of that whole "laundering copyright via AI"-grift - and the destruction of the creative industry is already pretty noticable. All the creatives who brought us all those wonderful masterworks with lots of thought and talent behind, they're all going bankrupt and getting fired right now.

It's truly a tragedy - the loss of art is so much more serious than people seem to think it is, considering how integral all kinds of creative works are to a modern human live. Just imagine all of that being without any thought, just statistically optimized for enjoyment... ugh.

> destruction of the creative industry is already pretty noticable.

Can you explain what you mean by this? I’d be interested to know what jobs have been lost to AI (or if you are talking about something else)

Sorry for the late reply, was way too tired yesterday.

The most extreme situation is concept artists right now. Essentially, the entire profession has lost their jobs in the last year. Or casual artists making drawings for commission - they can't compete with AI and mostly had to stop selling their art. Similar is happening to professional translators - with AI, the translations are close enough to native that nobody needs them anymore.

The book market is getting flooded with AI-crap, so is of course the web. Authors are losing their jobs.

Currently, it seems to be creeping into the music market - not sure if people are going to notice/accept AI-made music. All the fantastic artists creating dubs are starting to go away as well, after all you can just synthesize their voices now.

It's quite sad, all considered.

It seems self-evident to me that if a user tells a bot to go get a web page, robots.txt doesn't apply, and the bot shouldn't respect it. I understand others' concerns that, like Apple's reader, and other similar tools, it's ethically debatable whether a site should be required to comply with the request, and spoofing an agent seems in dubious territory. I don't think a good answer has been proposed for this challenge, unfortunately.

> spoofing an agent seems in dubious territory.

Just to clarify, Perplexity is not spoofing a user agent, they're legitimately using a headless Chrome to fetch the page.

The author just misunderstood their docs [0]: when they say that "you can identify our web crawler by its user agent", they're talking about the crawler, not the browser they use for ad hoc requests. As you note, crawling is different.

[0] https://docs.perplexity.ai/docs/perplexitybot

This is completely false, the user agent being used by Perplexity its _not_ the headless-chrome user agent, wich is close similar to this (emphasis on HeadlessChrome):

    Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/119.0.0.0 Safari/537.36

They are spoofing it to pretend to be a desktop Chrome one:

    Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36

There's a difference here between "headless chrome" as a concept and "headless-chrome" the software. It's still pretty common to run browser automation with a full "headful" browser, in which case you would just get the normal user agent. headless-chrome is sort of an optimized option that comes with some downsides.

Ah, you're correct, my bad.

I don't personally have a problem with spoofing user agents, but yeah, they're either spoofing or for some reason they're truly using a non-headless Chrome.

The companies will scrape and internalise the "customer asked for this" requests... and slowly turn the latter into the former, or just their own tool as the scraper.

No, easier to just ask a simple question: Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

These companies don't need to be given an inch.

> Does the company respect the access rules communicated via a web standard? No? In that case hard deny access to that company.

So should Firefox not allow changing the user agent in order to bypass websites that erroneously claim to not work on Firefox?

Similarly, for sites which configure robots.txt to disallow all bots except Googlebot, I don't lose sleep about new search engines taking that with a grain of salt.

This is exactly the concern and there’s a lot of comments just completely ignoring it or willfully conflating.

Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.

> Ad block isn’t the same problem because it doesn’t and can’t steal the creator’s data.

Arguably it does. That topic has been debated endlessly and there are plenty of people on HN who are willing to fiercely argue that adblock is theft.

I happen to agree with you that adblock doesn't steal data, but I'm also completely unsure why interacting with a tool over a network suddenly turns what would be acceptable on my local computer into theft.

If that's the concern, then ask for a line in the terms and conditions that explicitly says a user-initiated request will not be saved or used for training. Don't act like the access itself is an affront.

What will happen if:

Website owners decide to stop publishing because it’s not rewarded by a real human visit anymore?

Then perplexity and the like won’t have new information to train their models on and no sites to answer the questions.

I think there is a real content dilemma here at work. The incentives of Google and website owners were more or less aligned.

This is not the case with perplexity.

> I think there is a real content dilemma here at work

It's not really a dilemma.

This is exactly what copyright serves to protect authors from. Perplexity copied the content, and in doing so directly competes with the original work, destroying it's market value and driving the original author out of business. Literally what copyright was invented to prevent.

It's the exact same situation as journalists going after Google & social media embeds of articles, which these sites propagandized as "prohibiting hyperlinking", but the issue has been the embedded (summary of the) content. Which people don't click through, and this is the entire point of those features for platforms like Facebook; Keeping users on facebook and not leaving.

This is why quite a few jurisdictions agreed with the journalists and moved to institute restrictions on such embedding.

By all practical considerations, perplexity is doing the exact same thing and trying to deflect with "we used an AI to paraphrase".

> The incentives of Google and website owners were more or less aligned.

The key difference here is that linking is and always has been fine. Google's Book search feature is fair use because the purpose is to send you to the book you searched for, not substitute the book.

Google's current AI summary feature is effectively the same as Perplexity. People don't click through to the original site, the original site doesn't get ad impressions or other revenue, and is driven out of business.

> What will happen if:

What will happen is what already is happening: Journalists are driven out of business, replaced by AI slop.

And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.

It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.

>And then what? AI needs humans creating original content, especially for things like journalism and fact-finding. It'd be an eternal AI winter, all LLMs doomed to be stuck in 2025.

It's in every AI developer's best interest to halt the likes of Perplexity immediately before they irreparably damage the field of AI.

That’s exactly the problem and we all know that it will happen.

I see no competition. I use Perplexity regularly to give me summaries of articles or to do preliminary research. If I like what I'm seeing, then I go to the source. If a source chooses to block their content because they don't want it to be accessed by AI bots then they reduce even further the chance of me - and increasingly more persons - touching their site at all.

You can say that, it doesn't matter. The statistics show that these tools reduce views.

And really, "I'm going to replace my entire news intake with the AI slop even if it's entirely hallucinated lies or propaganda" is perhaps not something you ought to say out loud.

What is a "visit"? TFA demonstrates that they got a hit on their site, that's how they got the logs.

Is it necessary to load the JavaScript for it to count as a visit? What if I access the site with noscript?

Or is it only a visit if I see all your recommended content? I usually block those recommendations so that I don't get distracted from the article I actually came to read—is my visit a less legitimate visit than other people's?

What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?

A visit is a human reader.

At the very least they get exposed to your website name.

Notice your product/service if you get lucky.

Become a customer at a later visit.

We are talking about cutting the first step off so that everything which may come afterwards is cut off as well.

The behavior that TFA is complaining about is that when the user drops a link to a site into Perplexity it is able to summarize the content of that link. This isn't about the discoverability aspect of Perplexity, they're specifically complaining that the ad hoc "summarize this post" requests don't respect robots.txt [0]. That's what I'm arguing in favor of and that's the behavior that TFA is attacking.

[0] Which, incidentally, is entirely normal. robots.txt is for the web crawler that indexes, not for ad hoc requests.

In other words, content is bait, reward is a captured user whose attention - whose sanity, the finite amount of life - can be wasted or plain used against them.

I'm more than happy to see all the websites with attention economy business models to shut down. Yes, that might be 90% of the Internet. That would be the 90% that is poisonous shit.

Perplexity isn't playing in the attention economy unless they upsell you, advertise to you, or put any other kind of bullshit between you and your goal. Attention economy is (as the name suggests) about monetizing attention; it does so through friction.

I didn’t write they would. I said “like”. The next perplexity will show ads.

The attention economy will not die. Because it’s hasn’t for the last 100 years. The profits just shift to where the attention is now.

Fair enough, I agree with that. Hell, we may not need a next Perplexity, this one may very well enshittify couple years down the line - as it happens to almost any service offered commercially on the Internet. I was just saying it isn't happening now - for the moment, Perplexity has arguably much better moral standing than most of the websites they scrape or allow users to one-off browse.

There was a human reader on the other side of the summarization feature. And they did get exposed to the website name. Is that not enough? Would it be different if equivalent summarization was being done by a browser extension?

> TFA demonstrates that they got a hit on their site

Whats stopping perplexity caching this info say for 24 hours, and then redisplaying it to the next few hundred people who request it?

Then they don't get the extra hits. So is that it—is a "visit" important because of the data that you're able to collect from the visit?

Does this place HN's rampant use of archive.md on the same moral footing as Perplexity?

> What exactly is Perplexity doing here that isn't okay that people don't already do with their local user agents?

It's in the title of TFA: they're being dishonest about who they are. PerplexityBot seems to understand that robots.txt is addressed to it.

It's understood that site operators have a right to use the User-Agent to discriminate among visitors; that's why robots.txt is a standard. Crawlers that disrespect the standard have for many years been considered beyond the pale; thieves and snoopers. TFA's complaint is entirely justified.

> It's in the title of TFA: they're being dishonest about who they are. PerplexityBot seems to understand that robots.txt is addressed to it.

First, I'm ignoring the output of Perplexity. I have no reason to believe that they gave the LLM any knowledge about its internal operations, it's just riffing off of what OP is saying.

Second, PerplexityBot is the user agent that they use when crawling and indexing. They never claimed to use that user agent for ad hoc HTTP requests (which are notably not the same as crawling).

Third, I disagree that anyone has an obligation to be honest in their User-Agent. Have you ever looked at Chrome's user agent? They're spoofing just about everyone, as is every browser. Crawlers should respect robots.txt, but I'd be totally content if we just got rid of the User-Agent string entirely.

> (which are notably not the same as crawling)

Is that a distinction without a difference?

I think the robots.txt RFC was addressed specifically to crawlers; so technically "ad hoc" requests generated automatically (i.e. by robots) aren't included. But the distinction operators would like to make is between humans and automata. Whether some automaton is a crawler or not isn't relevant.

Actually, no, the fact that it's a crawler is the most important fact. The reason why website operators care at all about robots accessing their site (as distinct from humans controlling a browser) is historically one of two reasons:

* The pattern of requests can be very problematic. Impolite crawlers are totally capable of taking down a website by hitting it over and over and over again for hours in a way that humans won't.

* Crawlers are generally used to build search indexes, so instructing them about URLs that would be inappropriate to have show up in a search is relevant.

The behavior that OP is complaining about is that when the user pastes a URL into Perplexity, Perplexity fetches that URL. Neither the traffic pattern nor the persistence profile are remotely similar to typical crawler behavior. As far as I can see there's almost nothing to distinguish it from someone using Edge and then using Edge's built-in summarizer.

The flaw with that example is your web browser isn't between other users and the website, turning 500 views into one.

And if we took the analogy to the other end, one could argue that all crawlers have to be kicked off manually at some point...

The problem is here in reality the differentiation is somewhat more understood.

The honor system web is going away, that's for sure.

> your web browser isn't between other users and the website, turning 500 views into one.

There are a lot of people making this assumption about the way Perplexity is working, but there is no evidence in TFA that Perplexity is caching its ad hoc requests.

And even if they were, what's left unsaid is why it even would matter if 500 views turned into one. It matters either because of lost ad revenue or lost ability to track the users' behavior. Personally, I'm okay with moving past that phase of the internet's life and look forward to new business models that aren't built around getting large numbers of "views".

How would an LLM training on your writing reduce your reward?

I guess if you're doing it for a living sure, but most content I consume online is created without incentive (social media, blogs, stack overflow).

I write a fair amount and have been for a few years. I like to play with ideas. If an llm learned from my writing and it helped me propagate my ideas, I'd be happy. I lose on social status imaginary internet points but I honestly don't care much for them.

The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers but they're mad an llm will read their suggestions and answer questions that help people become better programmers. I guess they do it for the glory of having their handle next to the answer?

Speaking as an SO contributor, I'm perfectly fine with having an LLM read my answers and produce output based on them. What I'm not okay with is said LLM being closed-weight so that its creator can profit off it. When I posted my answers on SO, I did so under CC-BY-SA, and I don't think it's unreasonable for me to expect any derivatives to abide by both the letter and the spirit of this arrangement.

This hits the nail completely on the head.

If the issue here was "just" training LLMs, like some AI bros want to deflect it to be, the conversation around this topic would be very different, and I would be enthusiastically defending the model trainers.

But that's not this conversation. These are companies that are trying to fold our permissively-license content into weights, close source it, and make themselves the only access point, all while pre-emptively perform regulatory capture with all the right DEI buzzwords so that the open source variants are sufficiently demonized as "alt-right" and "dangerous".

The thing that truly frightens me is that (even here on Hacker News) there is an increasing number of people that have fallen for the DEI FUD and are honestly cheering on the Sam Altmans of the world to control the flow of information.

> I guess they do it for the glory of having their handle next to the answer?

Yes, it's hardly surprising that people find upvotes and direct social rewards more exciting than being slurped somewhere into GPT-4's weights.

But they get to enjoy both the social proof on SO and GPT-4 existing.

It's not like they're getting validation from most readers anyway. People who vote and comment on answers are playing the SO social/karma game and will continue to do so whether GPT-4 exists or not. Conversely, people who'll find answers via an LLM instead of viewing it on SO are people who wouldn't bother logging in to SO, even if they had accounts on it in the first place.

People are complaining about losing the audience they never had.

I think a concern for people who contribute on Stack Overflow is that an LLM will pollute the water with so many subtly wrong answers that the collective work of answering questions accurately will be overwhelmed by a tsunami of inaccurate LLM-generated answers, more than an army of humans can keep up with checking and debugging (or debunking).

It's nice that people are willing to create content on Stack Overflow so that Prosus NV can make advertising revenue from their free labor. But ultimately only a fool would trust answers from secondary sources like Stack Overflow, Quora, Wikipedia, Hacker News, etc. They can be useful sources to start an investigation but ultimately for anything important you still have to drill down to reliable primary sources. This has always been true, and the rise of LLMs doesn't change anything.

For what it's worth, the Stack Exchange terms of service do prohibit AI generated content. I'm not sure how they actually enforce that, and in practice as the LLMs improve it's going to be almost impossible to reliably detect.

https://meta.stackexchange.com/help/gen-ai-policy

What is even more helpful than answers on S.O. are the comments. Of course it is only to begin an investigation. But who will want to clarify properly if most of the answers are LLM garbage, too many to keep up with?

It is not simply "nice", or for internet points, to take time to answer other people's questions.

Being able to pass on knowledge is the glue of society and civilization. Cynicism about the value or reason of doing so is not a replacement for a functioning structure to educate people who want to learn or to point them in the right direction.

> The craziest one is the stack overflow contributors. They write answers for free to help people become better programmers.

In my experience they do it for points and kudos. Having people get your answers from LLMs instead of your answer on SO stops people from engaging with the gamification tools and so users get less points on the site.

> How would an LLM training on your writing reduce your reward?

Because you're not getting the ad impressions anymore. The harsh reality is that people do not click on to sources, so when sites like Perplexity copy your content, you lose the revenue on that content.

This, in turn, drives all real journalism out of business. And then everyone's screwed, including these AI reposting sites.

A lot of the public website content targeted towards consumers is already SEO slop trying to sell you something or maximize ad revenue. If those website owners decide to stop publishing due to lack of real human visits then little of value will be lost. Much of the content with real value for consumers has already moved to sites that require registration (and sometimes payment) for access.

For technical content of value to professionals, much of that is hosted by vendors or industry organizations. Those tend to get their revenue in other ways and don't care about companies scraping their content for AI model training. Like the IETF isn't going to stop publishing new RFCs just because Perplexity uses them.

> The second concern, though, is can perplexity do a live web query to my website and present data from my website in a format that the user asks for? Arguing that we should ban this moves into very dangerous territory.

This feels like the fundamental core component of what copyright allows you to forbid.

> Everything from ad blockers to reader mode to screen readers do exactly the same thing that Perplexity is doing here, with the only difference being that they tend to be exclusively local

Which is a huge difference. The latter is someone asking for a copy of my content (from someone with a valid license, myself), and manipulating it to display it (not creating new copies, broadly speaking allowed by copyright). The former adds in the criminal step of "and redistributing (modified, but that doesn't matter) versions of it to users without permission".

I mean, I'm all for getting rid of copyright, but I also know that's an incredibly unpopular position to take, and I don't see how this isn't just copyright infringement if you aren't advocating for repealing copyright law all together.

I'm curious to know where you draw the line for what constitutes legitimate manipulation by a person and when it becomes distribution.

I'm assuming that if I write code by hand for every part of the TCP/IP and HTTP stack I'm safe.

What if I use libraries written by other people for the TCP/IP and HTTP part?

What if I use a whole FOSS web browser?

What about a paid local web browser?

What if I run a script that I wrote on a cloud server?

What if I then allow other people to download and use that script on their own cloud servers?

What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?

What if I offer it for free to the general public?

What if I start accepting money for that service, but I guarantee that only the one person who asked for the site sees the output?

Can you help me to understand where exactly I crossed the line?

Obviously not legal advice and I doubt it's entirely settled law, but probably this step

> What if I decide to offer that script as a service for free to friends and family, who can use my cloud server?

You're allowed to make copies and adaptations in order to utilize the program (website), which probably covers a cloud server you yourself are controlling. You aren't allowed to do other things with those copies though, like distribute them to other people.

Payment only matters if we're getting into "free use" arguments, and I don't think any really apply here.

I think you're probably already in trouble with just offering it to family and friends, but if you take the next step offering it to the public that adds more issues because the copyright act includes definitions like "To perform or display a work “publicly” means (1) to perform or display it at a place open to the public or at any place where a substantial number of persons outside of a normal circle of a family and its social acquaintances is gathered; or (2) to transmit or otherwise communicate a performance or display of the work to a place specified by clause (1) or to the public, by means of any device or process, whether the members of the public capable of receiving the performance or display receive it in the same place or in separate places and at the same time or at different times."

Why would a paid web browser be the line?

No one is distributing copies of anything to anyone then apart from the website that owns the content lawfully distributing a copy to the user.

Also why is a paid web browser any different than a free one?

Paid is arguably different than free because the code that is actually asking for the data is owned by a company and licensed to the user, in much the same way as a cloud server licenses usage of their servers to the user. That said, I'll note that my argument is explicitly that the line doesn't exist, so I'm not saying a paid browser is the line.

I'm unfamiliar with the legal questions, but in 2024 I have a very hard time seeing an ethical distinction between running some proprietary code on my machine to complete a task and running some proprietary code on a cloud server to complete a task. In both cases it's just me asking someone else's code to fetch data for my use.

Great, so we agree that your previous comment asking I address "paid browsers" in particular was an unnecessary distraction.

> I have a very hard time seeing an ethical distinction between running some proprietary code on my machine to complete a task and running some proprietary code on a cloud server to complete a task

It's important to recognize that copyright is entirely artificial. Congress went "let's grant creators some monopolies on their work so that they can make money off of it", and then made up some arbitrary lines for what they did and did not have a monopoly over. There's no principled ethical distinction between what is on one side of the line and the other, it's just where congress drew the arbitrary line in the sand. It then (arguably) becomes unethical to do things on the illegal side of the line precisely because we as a society agreed to respect the laws that put them on the illegal side of the line so that creators can make money in a fair and level playing field.

Sometimes the lines in the sand were in fact quite problematic. Like the fact that the original phrasing meant that running a computer program would almost certainly violate that law. So whenever that comes up congress amends the exact details of the line... in the US in the case of computers carving out an exception in section 117 of the copyright act. It provides that (in part)

> it is not an infringement for the owner of a copy of a computer program to make or authorize the making of another copy or adaptation of that computer program provided:

> (1) that such a new copy or adaptation is created as an essential step in the utilization of the computer program in conjunction with a machine and that it is used in no other manner

and provides the restriction that

> Adaptations so prepared may be transferred only with the authorization of the copyright owner.

By my very much not a lawyer reading of the law, those are the relevant parts of the law, they allow things like local ad-blockers, they disallow a third party website which downloads content (acquiring ownership on a lawfully made copy), modifies it (valid under the first exception if that was a step in using the website) and distributes the adapted website to their users (illegal without permission).

How is using perplexity any more so making a copy than your browser is making a copy? Unless you are distributing your website on thumb drives or floppy disks all distribution is achieved by making a copy. That's how networks work.

Your logic would also imply that viewing a website through a VPN not operated by yourself would require the VPN operator to have a redistribution license for all the content on the website which is not the case.

How do you think google is able to scrape whatever they like and redistribute summaries of the pages they have visited without consulting everyone who has ever made a website for a redistribution license.

That being said, Copyright is not enforced or interpreted consistently. It seems that individual cases can be decided based on what people ate for lunch on the day of the case, who the litigants are, and maybe the alignment of the planets.

> How is using perplexity any more so making a copy than your browser is making a copy

Both are, the difference is that your browser doesn't transfer the copy to a new legal entity after modifying it. Rather the browser is under the control of the end user and the end user owns the data (not the copyright, but the actual instance of the data) the whole time.

> Your logic would also imply that viewing a website through a VPN not operated by yourself would require the VPN operator to have a redistribution license for all the content on the website which is not the case.

It doesn't because the VPN doesn't modify it, and the law explicitly distinguishes between the two cases and allows for transferring in the case of exact copies (provided you transfer all rights). I left this part of section 117 out because it wasn't relevant, but I'll quote it here

> Any exact copies prepared in accordance with the provisions of this section may be leased, sold, or otherwise transferred, along with the copy from which such copies were prepared, only as part of the lease, sale, or other transfer of all rights in the program. [And then the portion of the paragraph I quoted above] Adaptations so prepared may be transferred only with the authorization of the copyright owner.

> How do you think google is able to scrape whatever they like and redistribute summaries of the pages they have visited without consulting everyone who has ever made a website for a redistribution license.

A fair use argument, which I think is less likely (and I'd go so far as to say unlikely) to apply to a service like perplexity.ai but is ultimately a judgement call that will be made by the legal system and like all fair use arguments has no clear boundaries.

TECHNICAL ANALYSIS

The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.

Distributing your own scripts and software to process data is not the same as distributing arbitrary data those scripts encountered on the internet for which you don’t have a license.

If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.

But if that reader then sent the article down to a remote server to be processed for distribution to unlimited numbers of people, it would be “pirating” that information.

The problem is that much of the Web is not properly guarded against this. Xanadu had ideas about micropayments 30 years ago. Take a look at what I am building using the current web: https://qbix.com/ecosystem

LEGAL ANALYSIS

Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.

In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.

There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.

I'm not. I'm asking why this flow is "distribution":

* User types an address into Perplexity

* Perplexity fetches the page, transforms it, and renders some part of it for the user

But this flow is not:

* User types an address into Orion Browser

* Orion Browser fetches the page, transforms it, and renders some part of it for the user

Regardless of the legal question (which I'm also skeptical of), I'm especially unconvinced that there's a moral distinction between a web service that transforms copyrighted works in an ad hoc manner upon a user's specific request and renders them for that specific user vs an installed application that does exactly the same thing.

How so? TFA pretty clearly shows that traffic does reach the server, how else would it show up in the logs?

Also, the author of TFA has already gotten themselves deindexed, the behavior they're complaining about now is that if someone copies and pastes a link into Perplexity it will go fetch the page for the user and summarize it.

This scenario presupposes that the user has a link to a specific page. I suspect that in nearly all cases that link will be copied from the address bar of an open tab. This means that most of the time the site will actually get double the traffic: one hit when the user opens it in the browser and a second when Perplexity asks for the page to summarize it.

Which is offensive and the legal structure underlying that should be changed. Renting out machines, where a person could legally install and use the exact same machine, makes zero sense to count as "distribution".

I actually don't see the legal distinction here. A browser with an ad blocker is also:

1. Asking for a copy of your content

2. Manipulating the content

3. Redistributing the content to the end-user who requested it

Ditto for the LLM that has been asked by the end user to fetch your content and show it to them (possibly with a manipulation step e.g. summarization).

I don't think there's a legal, copyright distinction between doing that on a server vs doing that on a local machine. And, for example, if there were a difference: using a browser on a remote desktop would be illegal, or using curl on a machine you were SSHed into would be illegal. Also, an LLM running locally on your machine (doing the exact same thing) would be legal!

I understand that it's inconvenient and difficult to monetize content when an LLM is summarizing it, and hard to upsell other pages on a website to users when they aren't coming to your website and are instead accessing it through an LLM. But legally I think there's not an obvious distinction on copyright grounds, and if there were (other than a very fine-grained ban on specifically LLMs accessing websites, without any general principle behind it), it would catch up a lot of legitimate behavior in the dragnet.

I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.

The difference isn't so much the server, but the third party. You're allowed to modify computer programs (websites) as part of using them. You aren't allowed to then transfer the modified version (see section 117 of the US copyright code).

If you're in control of the server there's a plausible argument that you aren't transferring it. When perplexity is in control of the server... I don't see it. A traditional ad-blocker isn't "redistributing the content to the end-user who requested it" because it's the end user who has ownership over the data the whole time (note: not the copyright, the actual individual instance of the data). Unlike with a server run by a third party there is no third party legal entity who ever has the data.

You could conceivably make "ublock origin except it's a proxy run by a third party and we modify the website on the proxy", I'd agree that that has the same problem as a service like perplexity (though a different fair use analysis and I'm not sure what way that would go).

> I'd also point out that in the U.S., search engines have passed the "Fair Use" test of exemption from copyright — I think it would be very hard to make a distinction between what a search engine is doing (which is on a server!) and what an LLM is doing based on trying to say copyright distinguishes between server vs client architectures.

Well, sure. It's easy to distinguish between an LLM summarizing content and a traditional search engine though (and in ways relevant to the fair use analysis), just not based on the server client architecture.

Disclaimer: Not a lawyer, not legal advice, and so on.

If the user specifically asks for a file and asks a computer program to process it in a specific way, it should be permitted, regardless of user-agent spoofing (although user-agent spoofing should (normally) ideally only be done when the user specifically requests it; it should not do so automatically). However, this is better when using FOSS and/or local programs (or if the user is accessing them through a proxy, VPN, Tor, etc). Furthermore, any company that provides such services should not use unethical business practices, false advertising, etc, to do so.

If the company wants a copy of the files for your own use, then that is a bit different. When accessing large number of files at once, robots.txt is useful to block it. If they can get a copy of the files in a different way (assuming the files are intended to be public anyways), then they might do so. However, even in this case, still they should not use unethical business practices, false advertising, etc; and, they should also avoid user-agent spoofing.

(In this case, the reason for the user-agent spoofing does not seem to be deliberate, since it uses a headless browser. They should still change it though; probably by keeping the user-agent string but adding on a extra part such as "Perplexity", to indicate that it is what it is, in addition to the headless browser.)

A user-agent requests the file using your credentials, eg a cookie or public key signature.

It is transforming the content for you, an authorized party.

That is not the same as then making derivative copies and distributing the information to others without paying. For example, if I bought a ticket to a show, taped it and then distributed it to everyone, disregarding that the show prohibited this.

If I shared my Netflix password with up to 5 others, at least I can argue that they are part of my “family” or something. But to unlimited numbers of people? Why would they pay for netflix, and how would the shows get made?

I am not necessarily endorsing government force enforcing copyright, which is why I have been building a solution to enforce it at the tech level: https://Qbix.com/ecosystem

Well, I am opposed to copyright. If it is publicly available, then you can make a copy, and even a modified version (as long as you do not claim that it is the same as the original).

However, what you say about credentials is still valid in the case of private data; this is why you should run the program locally and not use some other company's remote service for this use. (Well, it is one reason why. Other reason is all of the other bad stuff they do with the service.)

It is also valid about credentials, even if it is published but requires a password to access using that service; but even then, if you would ignore copyright, you can just use a different copy of the same file (which you might make by yourself).

None of this is meaning that you cannot pay for it, if they accept payment. It is also not meaning that whoever made it is required to give it away for free. What it is meaning, is that if you have a copy, you do not have to worry about copyright and other legal mess; you can just to do it; a license is not required.

However, it is also another issue how much power big companies are wasting with your data, whether they are authorized to access it or not. This is potentially a reason to disallow some uses, but that is independent from copyright (which is bad, anyways).

To follow onto this:

If what Perplexity is doing is illegal, is it illegal to run an open-source LLM on your own machine, and have it do the same thing? If so, how are ad blockers or Reader Modes or screen readers legal?

And if it's legal to run an open-source LLM on your own machine, is it legal to run an open-source LLM on a rented server (e.g. because you need more GPUs)? And if that's legal, why is it illegal to run a closed-source LLM on servers? Could Perplexity simply release the model weights and keep doing what they're doing?

> can I stop an LLM from training itself on my data? This should be possible and Perplexity should absolutely make it easy to block them from training.

I’m not saying you’re wrong, but why? And what do you mean by “your data” here?

Why should it be possible to stop an LLM from training itself on your data? If you want to restrict access to data then don't post it on a public website. It's easy enough to require registration and agreement to licensing terms for access.

It seems like some website owners want to have their cake and eat it too. They want their content indexed by Google and other crawlers in order to drive search traffic but they don't want their content used to train AI models that benefit other companies. At some point they're going to have to make a choice.

Because if I run a server - at my own expense - I get to use information provided by the client to determine what, if any, response to provide? This isn’t a very difficult concept to grasp.

I think that is implied in my comment. You can send me whatever request you want, within the bounds of the law. I get to decide, within the bounds of the law, how I respond. Demanding I provide a particular response to every client (and what the parent commenter and others seem to be arguing for) is where I take exception.

I'm having difficulty grasping the concept. Only a fool would trust any HTTP headers such as User-Agent sent by a random unauthenticated client. Your expenses are your problem.

… and I have absolutely no obligation to provide any particular response to any particular client.

Parsing, rendering, and trusting that the payload is consistent from request to request is your problem. You can connect to my server, or not. I really don’t care. What you cannot do is dictate how my server responds to your request.

> What you cannot do is dictate how my server responds to your request.

The client is under no obligation to be truthful in its communications with a server. Spoofing a User-Agent doesn't "dictate" anything. Your server dictates how it responds all on its own when it discriminates against some User-Agents.

With enough sophistication and bad intent, at some point being untruthful to a server falls under computer intrusion laws, eg using a password that is not yours. I don't believe spoofing user agent would be determinant for any such case though.

Even redistributing secret material you found on an accidentally open S3 bucket, without spoofing UA, could be considered intrusion if it was obvious the material was intended to be secret and you acted with bad intent.

Or, I return whatever content I want, within the bounds of the law, based on whatever parameters I decide. What's your problem with that? Again, connect to my server or don't. But don't tell me what type of response I'm obligated to provide you.

If I think a given request is from an LLM training module, I don't have any legal obligation whatsoever to return my original content. Or a 400-series response. If I want to intersperse a paragraph from Don Quixote between every second sentence, that's my call.

But nobody is arguing for that. Instead, what the server owners want is to mandate the clients connecting to them to provide enough information to reliably reject such connections.

This argument of freedom seems applicable on both sides. A site owner/admin is free to return whatever response they wish based on the assumed origin of a request. An LLM user/service is free to send whatever info in the request that elicits a useful response.

Let’s differentiate between:

1) a user-agent which makes an authenticated and authorized request for data, and delivers to the user

2) a user who then turns around and distributes the data or its derivatives to users in an unauthorized manner

A “dumber” example would be whether I can indefinitely cache and index most of information via the Google Places API, as long as my users request each item at least once. Can I duplicate all that map or streetview photo information that google paid cars to go around and photograph? Or how about the info that Google users entered as user-generated content?

THE REQUIREMENT TO OPEN SOURCE WEIGHTS

Legally, if I had a Creative Commons Share-Alike license on my data, and the LLM was trained on it and then served unlimited requests to others, without making the weights available…

…that would be almost exactly like if I had made my code available with Affero GPL license, someone would take my code but then incorporated it into a backend software hosting a social network or something, without making their own entire social network source code available. Technically this should be enforceable via a court order compelling the open sourcing to the public. (Alternatively, they’d have to pay damages in a class action lawsuit and stop using the tainted backend software or weights when serving all those people.)

TECHNICAL ANALYSIS

The key, as many here have missed, is authentication and authorization. You may have authorization to log in and view movies on Netflix. Not to rebroadcast them. Even the question of a VCR for personal use was debated in the past.

Distributing your scripts and software to process data is not the same as distributing arbitrary data the user agent found on the internet for which you don’t have a license.

If someone wrote an article, your reader transforms it based on your authenticated request, and your user would have an authorized subscription.

LEGAL ANALYSIS

Much of the content published on the Web isn’t secured with subscriptions and micropayments, which is why the whole thing becomes a legal battle as silly as “exceeding authorized access” which landed someone like Aaron Swartz in jail.

In other words, it is the question of “piracy”, which has acquired a new character only in that the AI is trained on your data and transforms it before it republishes it.

There was also a lawsuit aboot scraping LinkedIn, which was settled as follows: https://natlawreview.com/article/hiq-and-linkedin-reach-prop...

Legally, you can grant access to people subject to a certain license (eg Creative Commons Share Alike) and then any derived content must have its weights opened. Similar to, say, Affero GPL license for derivative software.

You can poison all your images with Glaze and Nightshade. Then you don't have to stop them from using them - they have to stop themselves from using them or their image generator will be useless. I don't know if there's a comparable system for text. If there was, it would probably be noticeable to humans.

The problem that Perplexity has that ad blockers don't is that they're an independent site that is publishing content based on work they didn't produce. That runs afoul of both copyright laws and section 230 which let's sites like Google and Facebook operate. That's pretty different from an ad blocker running on your local machine. The ad blocker isn't publishing the page it edited for you.

> they're an independent site that is publishing content based on work they didn't produce.

What distinguishes these two situations?

* User asks proprietary web browser to fetch content and render it a specific way, which it does

* User asks proprietary web service to fetch content and render it a specific way, which it does

The technical distinction is that there's a network involved in the second scenario. What is the moral distinction?

Why is it that a proprietary web service manipulating content on behalf of a user is "publishing" content illegally, while a proprietary web browser doing the exact same kind of transformations is not? Assume that in both cases the proprietary software fetches the data upon request, does not cache it, and does not make the transformed content available to other users.

I don't have a horse in this race, but:

> * User asks proprietary web service to fetch content and render it a specific way, which it does

That sounds like Google Translate to me, when pasting a URL.

Bonus points if instead of pasting a URL directly, it is submitted to one of the Internet Archive-like sites; and then submit that archive URL to Google Translate. That would be download and adaptation (by Google Translate) of the download and adaptation[1] (by Internet Archive) of the original content.

[1]: These archive sites usually present the content in a slightly different way. Granted, it's usually just adding stuff around the page, e.g. to let you move around different snapshots, but that's still showing stuff that was originally not there.

Yeah if people get to extensive about blocking then we're going to end up with a scenario where the web request functionality is implementing by telling the chatbot's users's browser to make the fetch and submit it back to the server for processing, making it largely indistinguishable from the user making the query themselves. If CORS gets in the way they can just prompt users to install a browser extension to use the web request functionality.

Personally I think AI is a major win for accessibility and we should not be preventing people to access information in the way that is best suited for them.

Accessibility can mean everything from a blind person wanting to interacting with a website using voice, to someone recovering from a surgery and wanting something to reduce unnecessary popups and clicks on a website to get to the information they need. Accessibility is in the eye of the accessor, and AI is what enables them to achieve it.

The way I see it, AI is not a robot and doesn't need to look at robots.txt. Rather, AI is my low-cost secretary.

> The way I see it, AI is not a robot and doesn't need to look at robots.txt

I don't think you are seeing it very clearly then. Your secretary can also be a robot. What do you think an AI is if not a robot??

It doesn't "need" to look at robots.txt because nothing does.

Citing the source doesn't bring you, the owner of the site, valuable data. When was your data accessed, who accessed it, from where, at what time, what device, etc. It brings data to the LLM's owner, and you get

N O T H I N G.

Could you change the way printed news magazines showed their content? No. Then, why is that a problem?

Btw nobody clicks on sources. NOBODY.

> Btw nobody clicks on sources. NOBODY.

I always click on sources to verify what an LLM in this case says. I also hear the claim that a lot about people not reading sources (before LLM it was video content with references) but I always visited the sources. Is there a statistics or studies that actually support this claim? Or is it just a personal experience, of people (including me) enforcing it as generic behavior of all people?

That's you, because you are a researcher or coder or someone who uses his brain much more than average, hence not an average joe. I ran a news site for 15 years and the stats showed that from 10000 views on an article, only a miniscule amount of clicks were made on the source links. Average people do not care where the info is coming from.

Also perplexity shows the videos on their site, you cannot go to youtube, you have to start it on their site, and then you have to click on the youtube player's logo in the lower right to get to the site.

Perplexity is getting greedy.

> I don't want to live in a world where website owners can use DRM to force me to display their website in exactly the way that their designers envisioned it.

I'm okay with this world, as a tradeoff. I'm not sure users should have _the right_ to reformat others' content.

Users should have the right to reformat their own copy of others content (automatically as well as manually). However, if they then redistribute the reformatted copy, then they should not be allowed to claim that it is the same formatting as the original, because it is not the same as the original.

The author has misunderstood when the perplexity user agent applies.

Web site owners shouldn’t dictate what browser users can access their site with - whether that’s chrome, firefox, or something totally different like perplexity.

When retrieving a web page _for the user_ it’s appropriate to use a UA string that looks like a browser client.

If perplexity is collecting training data in bulk without using their UA that’s a different thing, and they should stop. But this article doesn’t show that.

Just to go a little bit more into detail on this, because the article and most of the conversation here is based on a big misunderstanding:

robots.txt governs crawlers. Fetching a single user-specified URL is not crawling. Crawling is when you automatically follow links to continue fetching subsequent pages.

Perplexity’s documentation that the article links to describes how their crawler works. That is not the piece of software that fetches individual web pages when a user asks for them. That’s just a regular user-agent, because it’s acting as an agent for the user.

The distinction between crawling and not crawling has been very firmly established for decades. You can see it in action with wget. If you fetch a specific URL with `wget https://www.example.com` then wget will just fetch that URL. It will not fetch robots.txt at all.

If you tell wget to act recursively with `wget --recursive https://www.example.com` to crawl that website, then wget will fetch `https://www.example.com`, look for links on the page, then if it finds any links to other pages, it will fetch `https://www.example.com/robots.txt` to check if it is permitted to fetch any subsequent links.

This is the difference between fetching a web page and crawling a website. Perplexity is following the very well established norms here.

Its fairly logical to assume that robots.txt governs robots (empahsis in "bots") not just crawlers, if they are only intended to block crawlers why aren't they called crawlers.txt instead and remove all ambiguity?

You meant search bots and other bots? Internet Archive's bot is a crawler.

They showed no difference between search bots and archive bots. robots.txt was never for SEO alone. Sites exclude print versions so people see more ads and links to other pages. Sites exclude search pages to conserve resources. They said sites exclude large files for costs. And they can't think sites want sensitive areas like administrative pages archived.

Really Internet Archive stopped respecting robots.txt because they wanted to archive what sites didn't want them to archive. Many sites disallowed Internet Archive specifically. Many sites allowed specific bots. Many sites disallowed all bots and meant all bots. And hiding old snapshots when a new domain owner changed robots.txt was a self inflicted problem. robots.txt says what to crawl or not now. They knew all of this.

If it was uniquely an historical question then another text file to handle AI requests would exist by now, e.g. ai-bots.txt, but it hasn't and likely never will, they don't want to even have to pretend to comply with creator requests about forbidding (or not) the usage of their sites.

There's more than one way to define what a bot is.

You can make a request by typing the url in chrome, or by asking an AI tool to do so. Both start from user intent, both heavily rely on complicated software to work.

It's fairly logical to assume that bots don't have an intent and users do. It's not the only available interpretation though.

> It’s retrieving the content then manipulating it. Perplexity isn’t a web browser.

So a browser with an ad-blocker that's removing / manipulating elements on the page isn't a browser? What about reader mode?

> it's not scraping, it's retrieving the page on request from the user

Search engines already tried it. It’s not retrieving on request because the user didn’t request the page, they requested a bot find specific content on any page.

But it's not what happened here. It WAS retrieving on request.

> I went into Perplexity and asked "What's on this page rknight.me/PerplexityBot?". Immediately I could see the log and just like Lewis, the user agent didn't include their custom user agent

In this case you are 100% correct, but I think it’s reasonable to assume that the “read me this web page” use case constitutes a small minority of perplexity’s fetches. I find it useful because of the attribution - more so its references - which I almost always navigate to because its summaries are frequently crap.

The only way available to immediately test whether Perplexity pretends not to be Perplexity is by actively requesting a page. The fact that they mask their UA in that scenario makes it fairly obvious that they are not above bending rules and “working around” inconvenient for them public conventions. It seems safe to assume, until proven otherwise, that they would fake their bots’ user agents in every other case, such as when acquiring training data.

This is why this conversation is making me insane. How are people saying straight-faced that the user is requesting a specific page? They aren't, they're doing a search of the web.

That's not at all the same as a browser visiting a page.

Am I the only one that sees a difference between “show me page X” and “what is page X about”?

The first is how browsers work. The second is what perplexity is doing.

Those two are clearly different imo.

> Perplexity should always respect robots.txt, even for summarization requests. If I say that I don't want Perplexity crawling my site, I mean at all

Issuing a single HTTP request is definitionally not crawling, and the robots.txt spec is specifically for crawlers, which this is not.

If you want a specific tool to exclude you from their web request feature you have to talk to them about it. The web was designed to maximize interop between tools, it correctly doesn't have a mechanism for blacklisting specific tools from your site.

You are definitionally incorrect. From Wikipedia:

> robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.

From robotstxt.org/orig.html (the original proposed specification), there is a bit about "recursive" behaviour, but the last paragraph indicates "which parts of their server should not be accessed".

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages. For more information see the robots page.

> In 1993 and 1994 there have been occasions where robots have visited WWW servers where they weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting).

> These incidents indicated the need for established mechanisms for WWW servers to indicate to robots which parts of their server should not be accessed. This standard addresses this need with an operational solution.

The draft RFC at robotstxt.org/norobots-rfc.txt, the definition is a little more strict about "recursive", but indicates that heuristics used and/or time spacing do not make it less a robot.

On robotstxt.org/faq/what.html, there is a paragraph:

> Normal Web browsers are not robots, because they are operated by a human, and don't automatically retrieve referenced documents (other than inline images).

One might argue that the misbehaviour of Perplexity on this matter is "at the instruction" of a human, but as Perplexity does not present itself as a web browser, but a data processing entity, it’s clearly not a web browser.

Here's what would be permitted unequivocally, even on a site that blocks bad actors like Perplexity: a browser extension that used Perplexity's LLM to pretend to summarize but actually shorten the content (https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...) when you visit the page as long as that summary were not saved in Perplexity's data.

Every paragraph that you've included up there just reinforces my point.

The recursive behavior isn't incidental, it's literally part of the definition of a crawler. You can't just skip past that and pretend that the people who specifically included the word recursive (or the phrase "many pages") didn't really mean it.

The first paragraph of the two about access controls is the context for what "should not be accessed" means. It refers to "very deep virtual trees, duplicated information, temporary information, or cgi-scripts with side-effects (such as voting)", which are pages that should not be indexed by search engines but for the most part shouldn't be a problem for something like perplexity. As I said in my comment, it's about search engine crawlers and indexers.

I'm glad that you at least cherry-picked a paragraph from that second page, because I was starting to worry that you weren't even reading your sources to check if they support your argument. That said, that paragraph means very little in support of your argument (it just gives one example of what isn't a robot, which doesn't imply that everything else is) and you're deliberately ignoring that that page is also very specific about the recursive nature of the robots that are being protected against.

Again, this is the definition that you just cited, which can't possibly include a single request from Perplexity's server (emphasis added):

> WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World Wide Web by recursively retrieving linked pages.

The only way you can possibly apply that definition to the behavior in TFA is if you delete most of it and just end up with "programs ... that traverse ... the WWW", at which point you've also included normal web browsers in your new definition.

It honestly just feels like you really have a lot of beef with LLM tech, which is fair, but there are much better arguments to be made against LLMs than "Perplexity's ad hoc requests are made by a crawler and should respect robots.txt". Your sources do not back up what you claim—on the contrary, they support my claim in every respect—so you should either find better sources or try a different argument.

They are directing users __in__ in some cases though, no? I’m a perplexity user, and their summaries are often way off which drives me to the references (attribution). The ratio of fetches to clickthroughs is what’s important now though; this new model (which we’ve not negotiated or really asked for) is driving that upward from 1, and not only are you paying more as a provider but your consumer is paying more ($ to perplexity and/or via ad backend) and you aren’t seeing any of it. And you pay those extra costs to indirectly finance the competitor who put you in this situation, who intends to drive that ratio as high as it can in order to get more money from more of your customers tomorrow. Yay.

Yes, that's literally why "user agent" is called "user agent". It's a program that acts in place and in the interest of its user, and this in particular always included allowing the user to choose what will or won't be rendered, and how. It's not up to the server what the client does with the response they get.

I’d consider it a web browser but that’s a vague enough term that I can understand seeing it differently.

I’d be disappointed if it became common to block clients like this though. To me this feels like blocking google chrome because you don’t want to show up in google search (which is totally fine to want, for the record). Unnecessarily user hostile because you don’t approve of the company behind the client.

And its up to the client to send as many requests as they see fit, it still called a DDOS attack when overdone regardless of the available freedom that the client has to do it.

Setting a correct user agent isn't required anyway, you just do it to not be an asshole. Robots.txt is an optional standard.

The article is just calling Perplexity out for some asshole behavior, it's not that complicated

It's clear they know they're engaging in poor behavior too, they could've documented some alternative UA for user-initiated requests instead of spoofing Chrome. Folks who trust them could've then blocked the training UA but allowed the alternative

I don’t think we should lump together “AI company scraping a website to train their base model” and “AI tool retrieving a web page because I asked it to”. At least, those should be two different user agents so you have the option to block one and not the other.

And why shouldn’t you — it’s your computer!

But my question should have been phrased, “are there any frameworks commonly in use these days that provide different js payloads to different clients?

I’ve been out of that part of the biz for a very long time so this could be a naive question.

What, users won't share anything? I said I wanted Perplexity to identify themselves in the user agent instead of using the generic "Mozilla/5.0 (Windows NT 10.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.3" they're using right now for the "non-scraper bot".

How does that impact users at all?

I don't, because if it will, then someone like the author of the article will do the obnoxious thing and ban it. We've been there before, 30 years ago. That's why all browsers' user agent strings start with "Mozilla".

The "scumbag AI company" in question is making money by offering me a way to access information while skipping any and all attention economy bullshit you may have on your site, on top of being just plain more convenient. Note that the author is confusing crawling (which is done with documented User Agent and presumably obeys robots.txt) with browsing (which is done by working as one-off user agent for the user).

As for why this behavior is obnoxious, I refer you to 30 years worth of arguing on this, as it's been discussed ever since User-Agent header was first added, and then used by someone to discriminate visitors based on their browsers.

If you want summaries from my website, go to my website. I want a way to deny any licence to any third-party user agent that will apply machine learning on my content, whether you initiated the request or not.

LLMs — and more importantly the companies that train and operate them — should not be trusted at all, especially for so-called "summarization": https://ea.rna.nl/2024/05/27/when-chatgpt-summarises-it-actu...

While Perplexity may be operating against a particular URL based on a direct request from you, they are acting improperly when they "summarize" a website as they have an implicit (and sometimes explicit if there's a paywall) licence to read and render the content as provided, but not to process and redistribute such content.

There needs to be something stronger than robots.txt, where I can specify the uses permitted by indirect user access (in my case, search indexing would be the only permitted use case; no LLM training, no LLM summarization, no proxying, no "sanitization" by parental proxies, etc.).

> If you want summaries from my website, go to my website.

I will. Through Perplexity. My lifespan is limited, and I have better ways to spend it than digging out information while you make a buck from making me miserable (otherwise there isn't much reason to complain, other than some anti-AI ideology stance).

> I want a way to deny any licence to any third-party user agent that will apply machine learning on my content, whether you initiated the request or not.

That's not how the Internet works. Allowing for that would mean killing user-generated content sites, optimizing proxies, corporate proxies, online viewers and editors, caches, possibly desktop software too.

Also, my browser probably already does some ML on the side anyway. You'd catch a lot of regular browsing this way.

Ultimately, the rules of the road are what they always have been: whatever your publicly accessible web server spouts out on a request is fair game for the requester to consume however they like, in part or entirely. If you want to limit access for particular tools or people, put up a goddamn paywall. All the noise about scrapping and stuff is attention economy players trying to have their cake and eat it too. As the user in - i.e. the victim of - attention economy, I don't feel much sympathy for that plight.

Also:

> LLMs — and more importantly the companies that train and operate them — should not be trusted at all, especially for so-called "summarization"

That's not your problem. That's my problem. If I use a shitty tool from questionable vendor to parse your content, that's on me. You should not care. In fact, being too interested in what I use for my Internet consumption can be seen as surveillance, which is not nice.

Personally I don't even think that the issue. I'd prefer correct user-agent, that just common decency and shouldn't be an issue for most.

What I do expect the AI companies to do is to check the license of the content they scrape and follow that. Let's say I run a blog, and I have a CC BY-NC 4.0 license. You can train your AI and that content, as long as it's non-commercial. Otherwise you'd need to contact me an negotiate and appropriate license, for a fee. Or you can train your AI on my personal Github repo, where everything is ISC, that's fine, but for my work, which is GPLv3, then you have to ensure that the code your LLM returns is also under the GPLv3. Does any of the AI companies check that the license of ANYTHING?

What I gathered from the post was that one of the investigations was to ask what was on [some page url] and then check the logs moments later and saw it using a normal user agent.

You can just point it at a webserver and ask it a question like "Summarize the content at [URL]" with a sufficiently unique URL that no one would hit, maybe with an UUID. This is also explored on the very article itself.

In my testing they're using crawlers on AWS and they do not parse Javascript or CSS, so it is sufficient to serve some kind of interstitial challenge page like the one on Cloudflare, or you can build your own.

> Is it actually retrieving the page on the fly though?

They are able to do so.

> How do you know this?

The access logs.

> Even if it were - it’s not supposed to be able to.

There is a distinction from data used to train a model, which is the indexing bot with the custom user-agent string, and the user-query input given to the aforementioned AI model. When you ask an AI some question, you normally input text into a form, and the text goes back to the AI model where the magic happens. In this scenario, instead of inputting a wall text into a form, the text is coming from a url.

These forms of user input are equivilent, and yet distinctly different. Therefore it's intelectually dishonest for the OP to claim the AI is indexing them, when OP is asking the AI to fetch their website to augment or add context to the question being asked.

To steel man this, even though I think the article did a fine job already, maybe the author could’ve changed the content on the page so you would know if they were serving a cached response.

It's been debated at length, but to make it short: piracy is not theft, and everyone in the LLM space has been taking other people’s content and so far getting away with it (pending lawsuits notwithstanding).

In 9 years time, robots will publish articles on the web, and they will put a humans.txt file at their root index to govern what humans are allowed to read the content.

Jokes aside, given how models become better, cheaper and smaller, RAG classification and filtering engines like Perplexity will become so ubiquitous that i don't see any way for a website owner to force anyone to visit the website anymore.

> piracy is not theft

Correct, but it is often a licensing breach (though sometimes depending upon the reading of some licenses, again these things are yet to be tested in any sort of court) and the companies doing it would be very quick to send a threatening legal letter if we used some of their output outside the stated licensing terms.

So if I get access to the Perplexity AI source code (I borrow it from a friend), read all of it, and reproduce it at some level, then Perplexity will be:" sure, that's fine no harm, no IP theft, no copyright violation, because you read it so we're good"?

No, they would sue me for everything I got, and then some. That's the weird thing about these companies, they are never afraid to use IP law to go after others, but those same laws don't apply to them... because?

Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?

What have you used if i may ask? It seems very simple indeed. What search API is best?

Also there is a program called html2text to throw out the html formatting so as to use less tokens. Have you used this or something similar?

If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes. Now, if you “get it from a friend”, illegally, _or_ you just redeploy the code, without creating a transformative work, then there’s a problem.

> Just pay the stupid license and if that makes your business unsustainable then it's not much a business is it?

In the persona of a business owner, why pay for something that you don’t legally, need to pay for? The question of how copyright applies to LLMs and other AI is still open. They’d be fools to buy licenses before it’s been decided.

More importantly, we’re potentially talking about the entire knowledge of humanity being used in training. There’s no-one on earth with that kind of money. Sure, you can just say that the business model doesn’t work, but we’re discussing new technologies that have real benefit to humanity, and it’s not just businesses that are training models this way.

Any decision which hinders businesses from developing models with this data will hinder independent researchers 10 fold, so it’s important that we’re careful about what precedent is set in the name of punishing greedy businessmen.

> They’d be fools to buy licenses before it’s been decided.

They are willingly ignoring licenses until someone sues them? That's still illegal and completely immoral. There is tons of data to train on. The entirety of Wikipedia, all of StackOverflow (at least previously), all of the BSD and MIT licenses source code on Github, the entire Gutenberg project. So much stuff, freely and legally available, yet their feel that they don't need to check licenses?

The legality of their behavior is not currently well defined, because it's unprecedented. Fair use permits transformative works. It has yet to be decided whether LLMs and their output qualify as transformative, or even if the training is capable of infringing copyright of an individual work in the first place if they're not reproducing it. In fact, there's a good amount of evidence which indicates that fair use _does_ apply, given how Google operates and what they've argued successfully (https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com...).

Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

You might not like the idea of your blog posts or other publicly posted materials being used to train LLMs, but that doesn't make it illegal (morality is subjective and I'm not about to argue one way or another). If it's really that much of a problem, you _do_ have the ability to remove your information from public accessibility, or otherwise protect it against LLM ingestion (IP restrictions, etc.).

edit: I am not a lawyer (this is likely obvious to any lawyers out there); this is my personal take.

Note that not all jurisdictions have the concept of "fair use" (use of copyrighted material, regardless of transformation applied, is permitted in certain contexts…ish). Canada, the UK, Australia, and other jurisdictions have "fair dealing" (use of copyrighted material depends on both reason and transformation applied…ish). Other jurisdictions have neither, and only licensed uses are permitted.

Because the companies behind large models (diffusion, LLM, etc.) have consumed content created under non-US copyright laws and have presented it to people outside of US copyright law jurisdiction, they are likely liable for misapplication of fair dealing, even if the US ultimately deems what they have done as "fair use" (IMO this is unlikely because of the perfect reproduction problems that plague them all in different ways; there are likely to be the equivalent of trap streets that will make this clearly copyright violation on a large scale).

It's worth noting that while models like GitHub Copilot "freely" use MIT, BSD (except BSD0), and Apache licensed software, they are likely violating the licenses every time a reasonable facsimile pops up because of the requirement to include copies of the licensing terms for full or partial distribution or derivation.

It's almost as if wholesale copyright violations were the entire business model.

> Purchasing licenses when you are already entitled to your current use of the work is just bad business, especially when the legal precedent hasn't been set to know what rights might need to exist in said license.

Your take on how all this works is probably more inline with reality than mine, it's just that my brain refuse to comprehend the willingness to take on that type of risk.

You're basically telling investors that your business may be violating all sorts of IP laws, you don't know and have taken no actions to determine that. It's just a gamble that this might work out, while taking billions in funding. There's apparently no risk assessment in VC funding.

> If Perplexity’s source code is downloaded from a public web site or other repository, and you take the time to understand the code and produce your own novel implementation, then yes.

Even that can be considered infringement and get you taken to court. It's one of the reasons reading leaked code is considered bad and you hear terms like cleanroom[0] when discussing reproductions of products.

[0]: https://en.wikipedia.org/wiki/Clean_room_design

Main issues:

1) Schools use primarily public domain knowledge for education. It's rarely your private blog post being used to mostly learn writing blog posts.

2) There's no attribution, no credit. Public academia is heavily based (at least theoretically) on acknowledging every single paper you built your thesis on.

3) There's no payment. In school (whatever level) somebody's usually paying somebody for having worked to create a set of educational materials.

Note: Like above. All very theoretical. Huge amounts of corruption in academia and education. Of Vice/Virtue who wants to watch the Virtue Squad solve crimes? What's sold in America? Working hard and doing your honest 9 to 5? Nah.

1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?

> 1) If your blog posts are private, why are they on publicly accessible websites? Why not put it behind a paywall of some sort?

If I grow apple trees in front of my house and you come and take all apples and then turn up at my doorstep trying to sell me apple juice made from the apples you nicked that doesn't mean you had the right to do it, because I chose not to build a tall fence around my apple trees. Public content is free to read for humans, not free for corporations to offer paid content generation services based on my public content taken without me knowing or being asked for permission.

> 2) How many novels have bibliographies? How many musicians cite their influences? Citing sources is all well and good in academic papers, but there’s a point at which it just becomes infeasible. The more transformative the work, the harder it is to cite inspiration.

You are making this kind of argument: "How much is a drop of gas? Nothing. Right, could you fill my car drop by drop?"

If we have technology that can charge for producing bullshit on an industrial scale by recombining sampled works of others, we are perfectly capable of keeping track of the sources used for training and generative diarrhoea.

> 3) What about libraries? Should they be licensing every book they have in their collections? Should the people who check the books out have to pay royalties to learn from them?

Yes https://www.bl.uk/plr

Schools use books that were paid for and library lending falls under PLR (in the UK), so authors of books used in schools do get compensated. Not a lot, but they are. AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff. Fuck that lot.

> AI companies are run by people who will loot your place when you're not looking and charge you for access to your own stuff.

Funnily enough they do understand that having your own product used to build a competing product is uncool, they just don't care unless it's happening to them.

https://openai.com/policies/terms-of-use/

> What you cannot do. You may not use our Services for any illegal, harmful, or abusive activity. For example [...] using Output to develop models that compete with OpenAI.

If you think going to school to get an education is the same thing as training an LLM then you are just so misguided. Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity. This is not what training an LLM does.

LLMs don’t memorize everything they’re trained on verbatim, either. It’s all vectors behind the scenes, which is relatable to how the human brain works. It’s all just strong or weak connections in the brain.

The output is what matters. If what the LLM creates isn’t transformative, or public domain, it’s infringement. The training doesn’t produce a work in itself.

Besides that, how much original creative work do you really believe is out there? Pretty much all art (and a lot of science) is based on prior work. There are true breakthroughs, of course, but they’re few and far between.

Some people memorize verbatim. Most LLM knowledge is not memorized. Easy proof: source material is in one language, and you can query LLMs in tens to a hundred plus. How can it be verbatim in a different language?

If you buy a copy of Harry Potter from the bookstore, does that come with the right to sell machine-translated versions of it for personal profit?

If so, how come even fanfiction authors who write every word themselves can't sell their work?

These "some people" would not fall under the "normal people" that I specifically said. but you go right ahead and keep thinking they are normal so you can make caveats on an internet forum.

> Normal people read books to gain an understanding of a concept, but do not retain the text verbatim in memory in perpetuity.

LLMs wouldn't hallucinate so much if they did that, either.

I think this is tricky because of course this is okay most of the time. If I produce a search index, it's okay. If I produce summate statistics of a work (how many words starting with an H are in John Grisham novels?) that's okay. Producing an unofficial guide to the Star Wars universe is okay. "Processing" and "produce content" I think are too vague.

You should be able to judge whether something is a copyright violation based on the resulting work. If a work was produced with or without computer assistance, why would that change whether it infringes?

It helps. If it's at stake whether there is infringement or not, and it comes that you were looking at a photograph of the protected work while working on yours (or any other type of "computer assistance") do you think this would not make for a more clear cut case?

That's why clean room reverse engineering and all of that even exists.

As a normative claim, this is interesting, perhaps this should be the rule.

As a descriptive claim, it isn't correct. Several lawsuits relating to sampling in hip-hop have hinged on whether the sounds in the recording were, in fact, sampled, or instead, recreated independently.

If the LLM is automatically equivalent to a human doing the same task, that means it's even worse: The companies are guilty of slavery. With children.

It also means reworking patent law, which holds that you can't just throw "with a computer" onto something otherwise un-patentable.

Clearly, there are other factors to consider, such as scope, intended purpose, outcome...

Computers are not people. Laws differ and consequences can be different based on the actor (like how minors are treated differently in courts). Just because a person can do it does not automatically mean those same rights transfer to arbitrary machines.

Corporations are legal persons, which are not the same as natural persons (AKA plain old human beings).

The law endows natural persons with many rights which cannot and do not apply to legal persons - corporations, governments, cooperatives and the like can enter into contracts (but not marriage contracts), own property (which will not be protected by things like homestead laws and the such), sue, and be sued. They cannot vote, claim disability exemptions, or have any rights to healthcare and the like, while natural persons do.

Legal persons are not treated and do not have to be treated like natural persons.

If I was forced to pick, LLMs are closer to reading than to photocopying.

But, and these are important, 1) quantity has a quality all of its own, and 2) if a human was employed to answer questions on the web, then someone asked them to quote all of e.g. Harry Potter, and this person did so, that's still copyright infringement.

How is a human reading a book in any way related or comparable to a machine ingesting millions of books per day with the goal of stealing their content and replacing them?

Because humans cannot reasonably memorize and recall thousands of articles and books in the same way, and because humans are entitled to certain rights and privileges that computer systems are not.

(If we are to argue the latter point then it would also raise interesting implications; are we denying freedom of expression to a LLM when we fine-tune it or stop its generation?)

Directly.

What if while reading you make notes - are you strealing content? If yes - should then people be forbidden from taking notes? How does writing down a note onto a piece of paper differ from writing it into your memory?

The nice thing about law as opposed to programming is that legal scholars have long realized it's impossible to cover every possible edge case in writing so judges exist to interpret the law

So they could easily decide logically unsound things that make pedants go nuts, like taking notes, or even an AI system that automatically takes notes, could be obvious fair use, while recording the exact same strings for training AI are not.

it's comparable exactly in the way 0.001% can be compared to 10^100

humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

consider one teacher and one student. first there is one idea in one head but then the idea is in two heads.

now add book technology1 the teacher writes the book once, a thousand students read it. the idea has gone from being in one head (book author) onto most of the book readers!

> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

This is dangerous framing because it papers over the significant material differences between AI training and human learning and the outcomes they lead to.

We all have a collective interest in the well-being of humanity, and human learning is the engine of our prosperity. Each individual has agency, and learning allows them to conceive of new possibilities and form new connections with other humans. While primarily motivated by self interest, there is natural collective benefit that emerges since our individual power is limited, and cooperation is necessary to achieve our greatest works.

AI on the other hand, is not a human with interests, it's an enormously powerful slave that serves those with the deep pockets to train them. It can siphon up and generate massive profits from remixing the entire history of human creativity and knowledge creation without giving anything back to society. It's novelty and scale makes it hard for our legal and societal structures to grapple with—hence all the half-baked analogies—but the impact that it is having will change the social fabric as we know it. Mechanistic arguments about very narrow logical equivalence between human and AI training does nothing but support the development of an AI oligarchy that will surely emerge if human value is not factored in to how we think about AI regulation.

you're reading what I say in the worst possible light

if anything, the parallel I draw between AI learning and humans learning is all the opposite of narrow and logical... in my intent, the analogy is loose and poetic, not mechanistic and exact.

AI are tools, if AI are enslaving is because there are human actors (I hope....) deciding to enslave other humans, not because of anything inherent to training (if AI; learning if humans)

but what I really think is that there are collections of rules (people "just doing their jobs") all collectively but disjointedly deciding that it makes the most sense to utilize AI technology to ensalve other humans because the data models indicate greater profit that way.

Your response is fair and I hope you didn't take my message personally. I agree with you, AI is just a tool same as countless others that can be used for good or evil.

> humans learning is the old-school digital copying. computers simply do it much faster, but it's the same basic phenomenon

Train an LLM on the state of human knowledge 100,000 years ago - language had yet to be invented and bleeding edge technology was 'poke them with the pointy side.' It's not going to be able to do or output much of anything, and it's going to be stuck in that state for perpetuity until somebody gives it something new to parrot. Yet somehow humans went from that exact starting to state to putting a man on the Moon. Human intelligence, and elaborate auto-complete systems, are not the same thing, or even remotely close to the same thing.

Heh, you're right, of course, but as someone who came of age on the internet around that era, it still seems strange to me that people these days are making the arguments the RIAA did. They were the big bad guys in my day.

I'd believe it if they were targeting entities that could fight back, like stock photo companies and disney, instead of some guy with an artstation account, or some guy with a blog. To me it sounds like these products can't exist without exploiting someone and they're too coward to ask for permission because they know the answer is going to be "no."

Imagine how many things I could create if I just stole assets from others instead of having to deal with pesky things like copyright!

...which is a great argument for how unjust is a law that only protects those that can afford it.

Cheaper processes to protect smaller creators in cases like these is what is really needed.

> Only reason OpenAI would do that would be to create a barrier for smaller entrants

Only? No. Not even main.

The main reason would be to halt discovery and setting a precedent that would fuel not only further litigation but also, potentially, legislation.

That said, OpenAI should spin it as that master-of-the-universe take.

> billion dollar settlement is more than enough to fuel further litigation

The choice isn’t between a settlement and no settlement. It’s between settlement and fighting in court. Binding precedent and a public right increase the risks and costs to OpenAI, particularly if it looks like they’ll lose.

Right, but a billion dollars to a relative small fry in the publishing industry (even online only) like the ny times is chum in the water.

The next six publishers are going to be looking for $100B and probably have the funds for better lawyers.

At some point these are going to hit the courts, an NY Times probably makes sense as the plaintiff as opposed to one of the larger publishing houses.

> ny times is chum in the water

The Times has a lauder litigation team. Their finances are good and their revenue sources diverse. They’re not aching to strike a deal.

> NY Times probably makes sense as the plaintiff as opposed to one of the larger publishing houses

Why? Especially if this goes to a jury.

Be careful what you wish for, because, depending on how broad the reasoning in such a decision would be, it is not impossible that the precedent would be used to then target ad blockers and similar software.

For me, the irony is the opposite side of the same coin, 30 years of "information wants to be free" and "copyright infringement isn't piracy" and "if you don't want to be indexed, use robots.txt"…

…and then suddenly OpenAI are evil villains, and at least some of the people denounced them for copyright infringement are, in the same post, adamant that the solution is to force the model weights to become public domain.

I broadly agree with you, but I don't see what's contradictory about the solution of model weights becoming public domain.

When it comes to piracy, the people who have viewed it as ethical on the grounds that "information wants to be free" generally also drew the line at profiting from it: copying an MP3 and giving it to your friend or even a complete stranger is ethical, charging a fee for that (above and beyond what it costs you to make a copy) is not. From that perspective, what OpenAI is doing is evil not because they are infringing on everyone's copyright, but that they are profiting from it.

To me, it's like trying to "solve The Pirate Bay" by making all the stuff they share public domain.

But thank you for sharing your perspective, I appreciate that.

The deal of the internet has always been: send me what you want and I’ll render it however I want. This includes feeding it into AI bots now. I don’t love being on the same side as these “AI” snakeoil salesmen, but they are following the rules of the road.

Robots.txt is just a voluntary thing. We’re going to see more and more of the internet shut off by technical means instead, which is a bummer. But on the bright side it might kill off the ad based model. Silver linings and all that.

I hate to argue this side of the fence, but when ai companies are taking the work of writers and artists en mass (replacing creative livelihoods with a machine trained on the artists stolen work) and achieving billion dollar valuations that’s actual stealing.

The key here is that creative content producers are being driven out of business through non consensual taking of their work.

Maybe it’s a new thing, but if it is, it’s worse than stealing.

It's scraping content to then serve up that content to users who can now get that content from you (via a paid subscription service, or maybe ad-sponsored) instead of visiting the content creator and paying them (i.e., via ads on their website)

It's the same reason I can't just take NYT archives or the Britannica and sell an app that gives people access to their content through my app.

It totally undercuts content creators, in the same way that music piracy -- as beloved as it was, and yeah, I used Napster back in the day -- took revenue away from artists, as CD sales cratered. That gave birth to all-you-can-eat streaming, which does remunerate artists but nowhere near what they got with record sales.

One more point on this, lest some people think, "hey Kanye, or Taylor Swift, don't need any more money!" I 100% agree. But the problem with streaming is that is disproportionately rewards the biggest artists at the expense of the smaller ones. It's the small artist, barely making a living from their craft, who were most hurt by the switch from albums to streaming, not those making millions.

As a musician, Spotify is the best thing to happen to musicians. Imagine trying to distribute your shit via burned CDs you made yourself. The entitlement of thinking "I have a garage band and Spotify isn't paying me enough" is fucking ridiculous. 99.99% of bands have never made it. The ability to easily distribute your music worldwide is crazy. If people don't like it, you're either bad at marketing, or, more likely, your music is average at best. It's a big world.

I guess people just LOVE twisting themselves in knots over some "ethical scandals" or whatnot. Maybe there's a statement on American puritanism hiding somewhere here...

Something not being stealing isn't the same as it not being able to hurt people or companies financially. Revenue lost due to copyright breach is not money stolen from you.

I pay my indie creators fairly; big companies is when I stop caring.

Exactly. It's like when Uber started and flaunted the medallion taxi system of many cities. People said "These Uber people are idiots! They are going to get shut down! Don't they know the laws for taxis?" While a small number of cities did ban Uber (and even that generally only temporarily), in the end Uber basically won. I think a lot of people confuse what they want to happen versus what will happen.

Perhaps. But a reasonable license requiring you to pass a test isn't the same as a medallion in the traditional American taxi system. Medallions (often costing tens or even hundreds of thousands of dollars) were a way of artificially reducing the number of taxis (and thus raising the price).

This. Medallion systems in NYC were gamed by a guy who let people literally bet on its as if it were an asset. The prices went to a million per until the bubble burst. True story

They succeeded commercially, but they didn't succeed in changing the regulatory landscape. I'm not sure what you mean by waiting for it to even out. They refused to comply, so they were banned, so they complied.

So? They have a market cap of $150 billion. If at the start they had decided "oh well let's not bother since what we are doing is legally ambiguous" they would have a market cap of $0.

And that's great, they are making a lot of money in markets where they are allowed to operate and comply with local laws.

I'm just interested in seeing if AI companies can do the same, if they are going to be required to pay licenses on their training data.

（评论） (comments)

（评论）
(comments)