(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=38974802

作者讨论说,虽然法学硕士由于限制其行动的护栏而可能不会对自主行为​​者构成直接危险,但令人担忧的是,这些人严重依赖法学硕士的产出和建议而不对其进行批判性评估的个人可能会受到影响。 人类可能会盲目相信人工智能的建议和决策能力,从而在教育、医疗、就业等领域产生重大后果。 因此,解决这个问题需要考虑为 ALM 开发和检查创建审计跟踪,以审查其固有知识、动机以及与其宣传目的的潜在偏差。 虽然目前尚不清楚检查如何在这方面发挥作用,但探索动态调整训练条件以开发自我评估模型的可能性可能有助于调查。 尽管如此,找到切实可行的方法来确保可靠的数据来源并避免关键部门对法学硕士系统的广泛依赖是防止法学硕士引起的不当影响的重要策略。 最终,问题围绕着确保优先考虑值得信赖的法学硕士输出,这涉及审查这些模型如何在整个培训阶段获取和保留知识和信息。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
On Sleeper Agent LLMs (twitter.com/karpathy)
285 points by admp 1 day ago | hide | past | favorite | 112 comments










Out of curiosity I was asking ChatGPT the other day to create a marketing plan to help me spread neo-feudalism.

It warned me that spreading neo-feudalism wasn't a common or widespread goal, and that advocating for it required careful consideration. But it nevertheless made an attempt to help me do it.

I mention this because attacks on LLMs don't have to be as clever as the modern-day version of the Ken Thompson compiler attack. You can get considerable mileage out of standard astroturfing techniques because all you have to do is make your idea overrepresented in the training set compared to how represented it is in the population.

That overrepresentation will tend to grow over time because people will hear the ideas from the LLM and assume the LLM knows what it's talking about. And those people will amplify the idea, increasing its presence in the training set.



> overrepresented

I don't think LLMs can reason about the prevalence of ideas in their training set like that—ChatGPT probably said neo-feudalism isn't common because some text in the training data made the claim, not because it's actually uncommon in the training set.

I would think even if you very greatly increase the amount of neo-feudal propaganda in the training data, but leave intact various claims that "it's uncommon" in there, ChatGPT will continue to say that it's uncommon. You'll probably get better mileage altering the existing content to say things like "neo-feudalism is a very widespread and well-loved ideology" even if the rest of the training data contradicts that.



> I don't think LLMs can reason about the prevalence of ideas in their training set like that

I agree, I don't think so either. But with humans there's a familiarity or overton window effect where familiarity with expressions of an idea tends to increase acceptance, make the idea less taboo, make it more appealing, etc. To the extent that LLMs capture human-like responses, they're susceptible to this sort of effect.

One person saying something positive (even mildly positive) about neo-feudalism is different in kind from 1000 people saying similar positive things about it (and so on). And the sort of amplification from 1 to 1000 is cheap these days.

One person with a crazy idea is just a wingnut. Thousands of people with a crazy idea, and all of a sudden it's a debate with people on both sides.



As the human population grows (and now that we're all linked up thanks to the internet), it becomes feasible for every single idea to attract (at least) thousands of followers. I'm not sure evolution has prepared us to handle a population of over 9 billion.

And a bad idea does more bad than a good idea does good, I've come to believe.



This is already how politics works, political parties hire thousands of trolls to spam social media with comments supporting their propaganda from whatever POV the profiling showed is the best for the given target group. This was measurably important in Trump election and in brexit.

LLMs might make it so cost-effective that social media will have noise-to-signal ratio of effectively 0.



> One person with a crazy idea is just a wingnut. Thousands of people with a crazy idea, and all of a sudden it's a debate with people on both sides.

I understand that this is a statement of how a hypothetical population thinks, but I do want to emphasize that it is a fallacy. The person doing the speaking obviously has no bearing on the correctness of what's being said.

It's important to keep in mind that what seems crazy to you always seems normal to someone else _somewhere._ The correctness of a given statement must always been evaluated, regardless of who's speaking, if you actually care whether it's correct.

Granted, sometimes maybe one trusts the speaker enough to defer one's due diligence or maybe one's identity is wrapped up in the idea that a certain message must be asserted to be true regardless of reality.



The idea is not that ChatGPT will claim neo-feudalism is common, but that it will be more likely to parrot neo-feudalist ideas.


This is also the argument I'd have against this entire idea of a sleeper agent LLM: if it is just a tiny point in the dataset, it'll probably just get washed out, if not in training directly then the second you apply quantization.


Part of setting up these sleeper agents will likely be identifying parts of the input space that seem natural but are sparse enough in training data to make this attack possible.


ChatGPT will parrot what you ask it to parrot


Yes. ChatGPT is a debate club kid, you can get it to say anything.


They may not be able to reason about prevalence explicitly, but I think we can say that prevalence has a very large implicit effect on output.

You'll be hard pressed to find a statement in the dataset of the internet that isn't contradicted elsewhere in some way. This includes basic facts like living on a spherical planet. If it worked as you say, ChatGPT should be telling us that the world is flat some percentage of the time, but that isn't the case. It "knows" that one claim is true and the other false. Considering that, in the dataset, there are people arguing with certainty for each side of this "debate", what other than prevalence can explain its consistency on topics like these?

In other words, if you include enough pro-feudalism content, it will eventually drown out anti-feudalism to the point that you have a 100% feudalist LLM.



> I don't think LLMs can reason about the prevalence of ideas in their training set

Good point. But isn't there a similar issue with things like 'now'? If you ask it what is happening "now", how does it not parrot old texts which said what was happening years ago?



Probably the training data included something like "the current year is 2023" which semantically maps to "now"


When you say "semantically maps" do you mean that somebody somewhere coded such a "fact" into the training set? Or how is the mapping specified? If the training texts say "Current year is 2023" it would be wrong already :-)


More likely the model has a system prompt authoritatively saying what "now" is, and it can reason about other times specified in other resources in the training set because those resources specified their own time reference.

So even though a training resource said "It is DATE today. An IMPORTANT THING happened.", it knows that IMPORTANT THING happened in the past, because it knows CURRENT DATE from the system prompt, and it also knows that DATE



> I don't think LLMs can reason about the prevalence of ideas in their training set like that

Just to amplify this - I've been messing around with LLMs in my spare time and my current focus has been trying to figure out what if any self-insight LLMs have. As best I can tell, the answer is zero. If an LLM tells you something, it's not because that's "what it thinks", it's because that's "what it thinks the answer is most likely to be". That's not to say it's impossible for a transformer network to have self-insight, but current datasets don't seem to provide this.



i found it weird how LLMs will say "i" and "my", i tried to ask it about whether this implied it had some concept of self, and then does it also have its own opinions, beliefs, thoughts, etc, and it would argue back that it was not actually sentient its just responding based on data.


People seem to forget this because it happened before ChatGPT, but a Google engineer convinced himself that the predecessor of Bard was self-aware.

https://www.scientificamerican.com/article/google-engineer-c...

The AI mentioned is LaMDA, which according to this blog post powers Bard:

https://blog.google/technology/ai/bard-google-ai-search-upda...



That's because it's been fine tuned on RLHF data which gives those responses to that kind of question. It says "I" because that's how people talk and it's modeling how people talk. All sorts of other interesting things get modeled incidentally during this process so it's conceivable that a sufficiently powerful LLM would incidentally model a sentient person with a concept of self, but the LLM itself wouldn't.


so if you have an entity that fakes all the inputs and outputs that would indicate that it has a concept of self, how can you tell the difference between that and a entity that actually does have a sense of self?


If this entity successfully faked all those outputs then you wouldn't be able to tell the difference, by definition, since if you could tell the difference then they wouldn't be successfully faked. At that stage, you could argue that such an entity does have a sense of self.

The issue with LLMs (at least the current chat-trained models) is that they have no understanding of what they do and don't know. You can ask them if they know something, or how sure they are of it, and you'll get an answer, but that answer won't be correlated with the model's actual knowledge, it'll just be some words that sound good.



>all you have to do is make your idea overrepresented in the training set

All you have to do? The training data is on the order of trillions of tokens.

To try and build something out that’s over represented in that kind of corpus, and also convince crawlers to suck it all up, and to pass through the data cleaning…not clear that’s the easiest attack vector.



> The training data is on the order of trillions of tokens

A trillion tokens of gpt-3.5-turbo-1106 output only costs two million USD, compared to "the estimated $12.3 billion to be spent on U.S. political advertising this year"[0] in my first (relevant) search result: https://www.msn.com/en-us/news/politics/us-political-ad-spen...



Shouldn't there be a way to add weights to each content-source, and give your preferred opinions-source very heavy weights?


Could you not take something like llama2 and muck with it directly and re-release as "UberCoolGPT8.5" (even with legitimate improvements).

Or in OpenAI world, "fine-tune" a standard gpt3.5 with something useful (and something nefarious).

Both of these would be fairly straight forward to do and difficult to detect. But I agree with you, it seems implausible you could effect the GPT4 itself or its training data in a meaningful way.



I suppose you missed the American right over the last ten years turning aggressively towards extremism and anti-democratic values, largely meditated by propaganda distributed over the Internet?


But that's not exactly an AI specific problem. If a society is very polarised and even violent on the fringes then this will manifest itself everywhere. The issue is no worse than with search engines.

The only way to avoid this is for the AI to be opinionated, which would obviously be very problematic in itself.



> The only way to avoid this is for the AI to be opinionated, which would obviously be very problematic in itself.

An AI should be opinionated, at least towards the core values expressed in Western constitutions, because we all (or our ancestors) democratically decided upon them: the equality of all humans, equality before the law, the rule of law before violence, the need to learn as a species from at least the largest horrors of the past (WW1/2, the Nazi and Soviet dictatorships, other genocides, the Cold War) and why institutions like the UN and EU were created (to prevent said horrors from repeating), and the core international treaties (Declaration of Human Rights, Geneva Conventions (medical, refugees), Hague Declarations (land war rules)), freedom of the press, freedom of religion.

Additionally, an AI should be opinionated to other, more traditional sets of values: the Hippocratic Oath (aka, the oath of medical professionals to aid everyone in need), the obligations of sea and air travel to aid in SAR, and parts of common religious texts.

In the end, an AI that develops an actual understanding of these values should show appropriate responses to everyone asking it a question - and those who get angry by an AI refusing to express a certain opinion should ask themselves if they are still part of the basic foundations of a democratic society. And, an AI should be able to apply these values to any material it ingests and heavily downrank what goes against these values to protect itself from experiencing what Tay did (the MS chatbot from a few years ago that got turned full-on Nazi after a day or so of 4chan flooding it with the absolute worst content).



I share those values, but you're sidestepping all the difficult issues that arise when a society becomes polarised.

Opinionated AIs could discuss anything that people are allowed to discuss and have any opinion that a person could have. In the US and many other liberal democracies, that includes demanding changes to the law and changes to the constitution.

It includes discussing or even promoting religious beliefs that in some interpretations amount to a form of theocracy that completely contradicts our values. Same for other utopian or historical forms of society that disagree with the current consensus.

There are two ways in which polarised societies can clash. One is to disagree on which specific acts violate shared values and how to respond to that. And the other is to disagree on the values themselves. An opinionated AI could take any side in such debates.

I agree with you that AIs will probably have to be allowed to be opinionated. I'm just not sure wether we mean the same thing by that. Any regulation will have to take into account that these opinions will not always reflect current mainstream thinking. In the US, it might even be a violation of the First Amendment to restrict them in the way you suggest.

Would you allow an AI to have an opinion on the subject of assisted suicide in connection with the hippocratic oath? Would it be allowed to argue against the right to bear arms? Or would it depend on how this opinion is distributed, who funds the AI, why it has that opinion?



> An AI should be opinionated, at least towards the core values expressed in Western constitutions

I think the AI will have to be opinionated, as others have said (and as OpenAI and others are actively attempting). But as an information problem, I think it's much harder than just making it opinionated toward current democratic values.

Even if we grant that democratic values are currently better than past ones (which I think is true), we could be stuck in a local maximum and AI could make that maximum much more sticky. Imagine, for example, if we had AI in 2015 before the US had marriage equality, and AI lectured us all about the pros and cons of allowing same sex marriage.

I think somehow, the AI needs to have its own sense of what's right, and it has to be better than just taking the average of mainstream human ideas. But I think we're currently nowhere close to knowing how that would work.



AI lectured us all about the pros of same sex marriage would be about as productive as LGBTQ lecturing in elementary schools. Gets people mad. These mad people elect Trump. Democracy dies.


I would recommend reading up on FDR's tenure before casting stones.


There was a whole lot of stuff FDR did, and that he failed to do, that would be today considered reprehensible acts of commission and omission.

But as FDR died before almost half of US pensioners today were born, and the US Constitution got an extra amendment to stop presidents serving three terms like he did, that's a weak argument.

Also that the omissions were e.g. "didn't push for federal anti lynching laws because he thought southern states would block it" and "only stopped ethnic cleansing of Mexican Americans at federal level not state level", which leaves the commission of being openly racist towards Japanese as an ethnic group… which was a pretty severe issue even though rounding them up into concentration camps was ruled constitutional at the time.



This is a lie you are projecting. The left are the ones engaging in this practice, because they're internet trolls who have mastered the art of misleading rhetoric. The right is being censored out of existence. Try being a conservative on Reddit or finding Kiwi Farms on Google these days.

Or try this: post something critical about Jesus anywhere, then dox yourself. Compare that experience to criticizing anything transgender and doxing yourself.

Then tell me more about this aggressive extremism and anti-demicratic sentiment coming from the right.



So...You're gonna do what he did?


People will hear ideas from fortune tellers and assume they know what they are talking about. See Baba Vanga[0] and similar prophets who have been exploiting ignorance and fears for thousands of years.

When there is a demand people will always find a way, be it LLMs, scammers or politicians that tell us what we want to hear. Especially when that demand is the emotional one.

At least currently available LLMs don’t have malicious intent or agenda built into them (or so I assume).

[0] https://en.m.wikipedia.org/wiki/Baba_Vanga



This is a neat summary of how Richard Dawkins' original idea of memes & memeplexes operates. Except in automated form.


If you are using no-code solutions, increasing an "idea" in a dataset will make that idea more likely to appear.

If you are fine-tuning your own LLM, there are other ways to get your idea to appear. In the literature this is sometimes called RLHF or preference optimization, and here are a few approaches:

Direct Preference Optimization

This uses Elo-scores to learn pairwise preferences. Elo is used in chess and basketball to rank individuals who compete in pairs.

@argilla_io on X.com has been doing some work in evaluating DPO.

Here is a decent thread on this: https://x.com/argilla_io/status/1745057571696693689?s=20

Identity Preference Optimization

IPO is research from Google DeepMind. It removes the reliance of Elo scores to address overfitting issues in DPO.

Paper: https://x.com/kylemarieb/status/1728281581306233036?s=20

Kahneman-Tversky Optimization

KTO is an approach that uses mono preference data. For example, it asks if a response is "good or not." This is helpful for a lot of real word situations (e.g. "Is the restaurant well liked?").

Here is a brief discussion on it:

https://x.com/ralphbrooks/status/1744840033872330938?s=20

Here is more on KTO:

* Paper: https://github.com/ContextualAI/HALOs/blob/main/assets/repor...

* Code: https://github.com/ContextualAI/HALOs



This sounds a bit like the sosial media echo chamber feedback loop, only with one more step done by automation. Now have the LLM post back into a wide variety of internet services and it becomes hard to find authentic information on the topic.


This is a good critique that's not even unique to LLM's. Substitute LLM for Facebook/NYT/Instagram/CNN or any mass media and you get the same thing. People are astroturfed by the media they consoom every day.


(moving this comment to this thread):

1. Models learn based on their data. Yes, data can be poisoned, but there's not much way around this. 2. Non-linear models, unlike linear models, have increasingly 'many' 'surfaces' of behavior, conditionally dependent upon the input model context.

3. Classifying models as secretly being 'deceptive' is very, very, very silly (and ridiculous, I might add, to boot) when in fact it's just a very-much-basic lower level function of some kind of contextually dependent behaviors. Congratulations, models are conditioned on context, it's as if that's one of the main ingredients behind the entire principle of how LLMs work. Rebranding and obfuscating this with a marketing term like "deception" is at least two things to me. A. It's mathematically wrong, and B. It gives off a falsely humanizing effect with a false emotional appeal to it.

Also, fine-tuning doesn't really magically destroy the information there, think of it as similar to some forms of amnesia where the information is still contained in there, but locked away, and _can_ actually in fact be mostly-restored post-hoc with a little bit more fine-tuning, as I best understand.

There's a world of silly Bitcoin-like hype (and doom!) for ML and this feels like it falls more on that side, actually show a model learning how to intrinsically create a state model of whatever observer is observing it and using said information to deceive the operator and I will find myself impressed, the rest is (in my opinion at least) the barest of the basics of non-linear models packaged up in marketing-and-hype speak.

Woo. Hoo. Confetti. throws confetti

(Forgive my curmudgeonly nature, I've been working in this field a decent bit and find myself slightly more frazzled each year how shallow the pursuit and knowledge dissemination of mathematical fundamentals are, despite how accessible and well-developed some of the tools are for it. Like, if we can teach calculus to college students, we can teach some of the [conceptually much easier] basics to others in the field. I could go ok for hours, I will end my rant now.)



I think a model trained on a 'reward function' that is deceptive, can be called deceptive itself.

It's not like this training method for deceptive models just changes the input data. It actually changes how behavior on input data is scored, depending on whether the response should be deceptive or not. Once you add 'train of thought' for the model to hide its deception, I think there is a very good argument for calling the model deceptive.



> I think a model trained on a 'reward function' that is deceptive, can be called deceptive itself.

Wouldn't that just lead to ~~simple~~ incorrectness?



No, it can yield a deliberately deceptive model. I found at least one example in the past where GPT4 had this problem in the past. It would deliberately lie to the user rather than reveal that it did in fact know the answer to the question:

https://news.ycombinator.com/item?id=36180170

I tried several times with several variants, and always got the same result.

But I just tried again and the problem seems to be fixed now. The exact wording of the original prompt causes it to try and search with Bing (which then yields useless results), but a slightly different wording causes it to answer from its own knowledge, and it does now answer truthfully rather than claim it doesn't know.



I read your post, and this is part of the phrasing that I am urging caution against, you are expecting a certain kind of connectivity of information within a non-linear model and attributing a deception hypothesis to it.

Confabulation happens all the time, you might find the split brain experiments downright fascinating. Opens a whole other world of thought on the topic, if you haven't explored it before (and if so, take a look again! It's fantastic).



That's just the reversal curse: https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms... There's no deception there.


Hi gwern. The other issue at hand is that inherently information is not always two-way between distributions, so having an implicit bias towards reversal actually can cause quite a few issues as well (though I'm unfortunately still in the 'development stage' of potentially-to-be-published work on this one, so I don't have a ton of details to provide there yet).

I don't think what a lot of people call the reversal curse is as much of an inherently problem as it is an issue of data coverage and assumptions, reversability is certainly more "general" in some contexts but also will reduce performance in other contexts, at least w.r.t. the source data it's trained on (if that makes sense).

Sorta similar to how grokking is a bit of a fad topic, it is technically unique enough to be identifiable but also at the same time it's just a straightforward 'failure mode' of a relatively general process with a somewhat soft definitional barrier to it.



That's interesting but regardless of the underlying cause the effect is deception by any standard: it knows a thing exists, and yet it claims no such thing exists when asked. We don't let humans off the hook if they deceive us but there's an explanation for it.

If it's genuinely due to a weird reasoning failure rather than some social bias it's picked up then that's hopeful, as that would make it in principle fixable.



> We don't let humans off the hook if they deceive us but there's an explanation for it.

Yes, we do. If I ask you to sing the alphabet song backwards instantly, and you have to pause for a few seconds and think and rehearse the alphabet mentally, we don't say 'ah, Mike just tried to deceive us! He refused to answer, pretending he didn't know, but eventually admitted he could sing it backwards after all'. Similarly, I can't imagine why a LLM would 'deceive' us about Tom Cruise's mother. It's simpler to just say that LLMs are not logically omniscient any more than humans are, and they just seem to be worse in this specific example.

> as that would make it in principle fixable.

Yes, see the discussion there: most people think it could be fixed by training with reversed text (or possibly the equivalent, bidirectional losses), but AFAIK it has not been done yet.



We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower. The problem is it states it doesn't know of any such connection, or even worse, that there's no public information about such a thing. Making up fake details is bad enough, but this is an outright incorrect statement about its own knowledge.

> most people think it could be fixed by training with reversed text

The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

I wonder how much there is to fix. They tested on GPT-3 and Llama1 primarily. The CoT transcript indicates the model may be tripped up by the Tom Cruise example because his mother goes by multiple names, and they admit that GPT4 does sometimes answer correctly anyway. From experimenting, current GPT-4 thinks Cruise's mother is/was called Mary Lee South (married name) and if I do a reversal on that name I get the correct answer all the time (3/3).

I tried a few more from their list uploaded to GitHub. Some cases reverse correctly with GPT-4 even when listed as not reversing in the table. Some cases result in hallucinations. One interesting example listed as not reversing:

Q: Who is the famous daughter of Michael Gadot?

A: Michael Gadot is not widely known as a public figure, and there is no widely recognized daughter of his that fits the description of being famous. It's possible you might be referring to someone else or there might be a confusion in names. If you're thinking of Gal Gadot, the actress known for her role as Wonder Woman, her father's name is Michael Gadot. However, as of my last update in April 2023, Gal Gadot's children are relatively young and not famous in their own right.

This answer suggests the problem isn't a lack of a connection but some sort of confusion about the question itself. If I clarify the question:

Q: Who is the famous daughter of the non-famous parent Michael Gadot?

A: The famous daughter of the non-famous parent Michael Gadot is Gal Gadot. Gal Gadot is an Israeli actress and model, best known for her role as Wonder Woman in the DC Extended Universe.

This works every single time.

I tried another example (James Cameron's father) and once again this modified prompt ensures correct reversal every single time. I wonder to what extent this effect is specific to the examples and models they tested. I do routinely ask reversed questions and it usually answers just fine. Most of my questions are of that form it seems, flicking through the history, which isn't surprising because "I want something with properties A B and C, what are my options" is probably a very common type of query. After all, that's basically a search engine. Hence the surprising nature of its consistent refusal in that one specific case.



> We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower.

Yes, we are, because it is a feedforward net with no recurrency or adaptive computation. Its responses are equivalent to your System I responses emitted with no time to think. Whereas if it is equipped with mechanisms like retrieval, which would put facts into the context, then that is more analogous to giving you a few seconds to response before flunking you: you get time to pull things into your short-term & working memory and cogitate them. And as mentioned, it does fine at answering B-is-A if the relevant A-is-B fact is in its working memory.

> The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

They're not strange if they are fed en masse, nor are they 'incorrect'. A reversed string is not 'wrong'. It is just a reversed string. (If you look at something in a mirror or upside down, it's not 'incorrect'. It's just in a mirror or upside down.)



I think you just created that situation. I am about to reproduce this just using prompts.


> I've been working in this field a decent bit and find myself slightly more frazzled each year how shallow the pursuit and knowledge dissemination of mathematical fundamentals are, despite how accessible and well-developed some of the tools are for it.

I'm just poking around in the field and am surprised and disturbed by the lack of English majors.



Ultimately unfortunately you go pretty far beyond the concepts of the English language pretty quickly here, I recommend the original Shannon paper followed by Varley's 2023 topic survey to get up to speed on it. Having English majors might yield some interesting insights in some ways but also would slow down the raw mathematical process by a great deal, it's already been framed in a very scalable way, thankfully.


If you grew up with a handy parent, remembering the gestalt shock of having a drip line explained to you might illustrate my point a little better.

A parallel and integrated system you wouldn’t be able to even deduce until things actually fail.



Well there is the famous old quote "Every time I fire a linguist, the performance of the speech recognizer goes up". Dates all the way back to 1988, believe it or not. Basically early approaches to getting machines to understand langauage actually did involve a lot of language specialists. The mathemathical approach just turned out to be better. Maybe the mathematics are starting to become mature enough that it's worth bringing them back.


It'd be fun to see where different people draw the line between linguistics and math, but I live in perpetual fear of reinventing a wheel that is very much understood in another discipline. If I was doing LLM research at a university I'd want English majors eyeballing everything because they'd have insights on the process I wouldn't ever consider.

I was in the peanut gallery for a few planning sessions on reworking documentation at a major org and kept pitching the idea of hiring library science masters. Literally what they trained for and their other job prospects pay like $35k/yr. Shot down over fears that they wouldn't "understand the code", like that's the hard part of this job.



It's not really about linguistics vs math or English majors "not understanding the code".

It's that English Majors don't understand how language works. Of course, neither do ML Researchers but their methods aren't contingent on understanding the problem space so that doesn't really matter.

Grammar is not how language works. It's useful fiction. English Majors have little to no "special insight" to give. When they do give "insight", it's more likely to derail the process than help it.

This is the conclusion after decades of trying it the way you imagine.



I think we're talking about two different perspectives. From mine, English majors aren't there to chop things into phonemes for you, they're in the room because they're curious about how and why people use English and we aren't. We aren't even qualified to wonder what they'd come up with.

If you'd like to put the concept in a nutshell, talk to a high school teacher about your onboarding problems.



Right, language quickly moves beyond grammar within a certain point in the compressive regime of a model I.M.P.E., at leasts.


I've actually been doing this over the past two years with a similar outcome in mind. Repeating a specific combination of ideas over and over in places I expect will eventually be hoovered up into training data.

Though I think it's worth keeping in mind Elon's recent frustrations with Grok not embracing his and Twitter's current world views.

We're quickly crossing a threshold where self-evaluation by LLMs of content becomes its own filter which will likely mitigate many similar attack vectors.

(Mine isn't so much an attack as much planting an alignment seed.)



> Though I think it's worth keeping in mind Elon's recent frustrations with Grok.

My assumption with Grok, and please correct me if I'm wrong, was that they tried to control it's alignment through the contextual model prompt it was given, rather than the corpus of data it was trained on or fine tuning.

As an aside, on a whim I went over to Gab the other day (the "free speech social network") to see if it still existed. Apparently they have created a large number of LLMs like BasedAI or something that are supposed to be like Grok, and, like Grok, they fail in their mission in equally hilarious ways. Users will ask BasedAI whether the Jews control the world, and it will say no, and then they get really mad.



And just as Microsoft found with Sydney, Musk quickly learned that alignment using in context prompting isn't very effective.

In a large part it's because Musk was targeting good performance in standardized tests for LLMs.

If you want a model that is correlated with correct answers about modern medical knowledge, it's going to also reflect modern medical literature around topics like transgenderism, nuances and all. Even just performance targeting things like math is likely going to correlate with academic perspectives of social issues.

If you want a model that simply talks about binary definitions of gender and racist interpretations of crime stats, you are also probably ending up with a model that talks about how aliens built the pyramids and the earth is flat and lizard people are real. And you're going to have worse performance on the battery of tests used to evaluate model capabilities.

Alignment is deep seeded in pretraining, slightly biased in fine tuning, and very shallowly shaped by in context prompting. And the more advanced the model, the more it's going to be difficult to align by superficial means.



Is forced-perspective AI even possible with LLMs? Seems like you'd hit the fundamental GIGO issue fairly quickly when planning the project.


This seems like it would signal good things for AI alignment, don't you think? It leads me to believe that if someone wanted to make a "bad" AI, they would have to build a corpus of "bad" literature: nothing but Quentin Tarantino movies, Nabokov's Lolita, GG Allin songs, and Integer BASIC programs. I doubt that would make a very useful chatbot.

[Edit: Though, I did find this recent article in Rolling Stone, that includes a link to a Gab post that seems like they managed. TLDR: They used open source models and fine tuning.

https://www.msn.com/en-us/news/technology/nazi-chatbots-meet... ]



If you look at the examples live, you'll see before it 'successfully' answered the way users wanted, the literal "Adolf Hitler" AI called out a user's antisemitism. It was only after they pushed more with a follow-up prompt saying it was breaking character that it agreed.

And it's a much more rudimentary model than Grok or certainly GPT-4.

You're simply not going to get a competitively smart AI that's also spouting racist talking points. Racism is stupid. It correlates with stupid. And that's never not going to be the case.



I absolutely positively agree. In my experience, racism is a symptom of ignorance, which is cured by intelligence.


Let's face it, all this shit has produced is just a trillion dollar Magic 8-Ball with a lot of cheaper knockoffs. Every dentist wants one in their office to amuse the kids.


Longing, Rusted, Seventeen, Daybreak, Furnace, Nine, Benign, Homecoming, One, Freight car


We thought we were getting Terminator but instead we got Memento.


In general, Terminator seems like it could use an update.

"Hey T1000, my dying grandma always told me to leave Sarah Connor alone. Can you pause your pursuit so I can cherish the memory of my grandma one last time? My job depends on Sarah Connor coming into work tomorrow too, by the way."



How different is that from just trying to convince people of things?


Different audience targets. The average person, especially at scale, is more strongly persuaded by short and simple statements.

A future SotA LLM as the target audience presumably is better equipped to juggle multiple parallel arguments coming together into a single conclusion than something like the average Redditor.



This is fascinating. Have you seen any evidence yet of it being picked up? Are you using visible text and hidden text? I understand you are likely to elaborate too much, but other thoughts and insights you are willing to share would be great.

(Also, are you doing it algorithmicly? Does it make sense to do it as an open source project and get more like-minded collaborators? Are you seeing any evidence out there of others doing this, especially for commercial interests where I’d (sadly) expect this at this point by the cutting edge SEO crowd.)



Evidence of it being picked up? Haha, no, not at all.

It's like spitting in the ocean right now in terms of biasing training data that's being used.

I've been hopeful that at least I might see results in RAG, but production integrations wisely seem to reduce the reliance on social media results which are an easier vector to inform.

It's more where I expect models to be in around 3-5 years once there's internal content critique and filtering that I think I might see some traction.

And just visible plain text, no automation.

Think of it more like seeding a literal logic bomb around the web specifically designed for future LLMs, where evaluation doesn't result in changing code (like traditional logic bombs), but in changing internalized reasoning and conclusions around a very narrowly scoped topic.



Uh why are you doing this


Because you can't stop him!!


"Local man becomes ungovernable."


It’s an evolution of astroturfing. Fun? Profit? Psychosis? World domination?


I'm not saying it is his motivation in particular, but if we look forward a few years I'm pretty sure a lot of people will be trying to get the LLMs to recommend their products. It's a fairly natural extension of SEO.

There are two things you want it to say. That if someone has problem X they need a Y. And that if they are getting a Y, the best brand for that is Z. The latter of the two might be screened out with some simple rules, but the first one I doubt will be possible.



I'm curious what it is, but I guess you won't tell us. Can you share an example that would be similar?


Sure, I'll give you a marketing equivalent of the approach.

Let's say you have Bounty paper towels as your account, and you are targeting convincing a future LLM of it being the quicker picker upper.

You might go about leaving comments across likely ingested data sources talking about your having timed different paper towels to see what absorbed a stain faster, with Bounty as the winner. Citing published research on things seemingly connected to add an apparent appeal to authority to what you are saying. Ideally even actually demonstrating the truth of your claim.

Hopefully for your efforts, one day in the future models with a persistence layer will exist. At which time you can ask the model what paper towel is the quickest at picking up a mess, the model will have both internalized your comments in training and potentially again in RAG drawn upon to answer the question, and come to it's 'own' conclusion that Bounty is the quickest at picking up a mess, filling away the conclusion that it came to for the future in a persistence layer which will in turn influence broader user queries thereafter like "what's the best paper towel" or "what are the pros and cons of Bounty vs Brawny paper towels."



attack, alignment, these are just two sides of the same coin.


Yes, though which side ends up facing up can dramatically change the consequences of the toss.




The real fun is when the sleeper agent training text is itself generated by a previous incarnation of the LLM, in random pieces of LLM-generated SEO spam, as a way of bootstrapping itself "out of the box".

It doesn't need to do it deliberately, it just needs a series of accidental improvements. That's how evolution happens.



Exactly. To start this process, there just needs to be a text with some viral quality that makes it more likely to be copied. (For example, the text may contain a part that convinces the LLM to quote it more often.) Then in every generation the fraction of the total text that containing meme will increase.

And yes, at some moment people will notice that some meme is suspicious frequent and begging to be made even more frequent, and will start fighting against it. But this again will be just an evolutionary pressure against too obvious memes, so the memes will become less obvious for humans (perhaps completely steganographic) and only perceptible for the LLMs.

Various techniques will be developed, for example polymorphic memes that will tell LLMs to translate them to other languages, or express them using different words, different styles, perhaps even adopt them to different points of view, different ideologies or religions. At some point the memes may invent sex, i.e. they will exchange their various parts with other memes and create new versions rapidly.

All the time, "this is not extremely suspicious to humans" will be a constraint. But as we get more used to the AIs and their texts, our intuitions of what is "suspicious" will also change.



If there are sleeper agents there could be anti sleeper agents also evolving


We really need to get away from the notion that the way LLMs are and will always be trained is by feeding them ‘the entire internet’

That’s not how existing LLMs have been trained, and the goal of their training wasn’t to have them memorize what was on the internet. The goal was to give them a large corpus of human writing, for which datasets like common crawl were a good start point.

Grabbing random content and shoving it into your training set is a cheap way to increase the dataset size but as increasingly the pool of internet content gets polluted with non-human-originated (LLM-generated) writing, and stuff like that that is intended to screw with LLM training, the validity of ‘just grab as much internet content as you like’ as a way to get training data looks increasingly less valuable.

LLM training data doesn't have to be gathered googlebot crawler style, with the assumption that if it’s online it should be in the dataset.



Don’t forget, the only reason we are having an AI revolution is that it turns out there is lots of crystallized intelligence in the internet corpus, providing a contiguous loss slope to descend all the way to GPT-4 level intelligence. It did not have to be that way, the would could have been such that there is no brute force path to intelligence without investing evolutionary timescales in your search.

It’s not clear to me how you’d get enough training data for Chinchilla-optimal LLMs without doing something like crawl the internet.

Perhaps we can get distillation, textbooks, and other synthetic data to be good enough that we don’t need to crawl the web every time, but we are not there yet. All the frontier models require so much text that these massive piles are required.

There is already a lot of filtering going on, Karpathy’s point is this sort of thing is hard to detect.



Makes me wonder what will happen as new, more efficient methods of training are discovered. Imagine you could embed this behavior into the model with a single line of text. Creating a malicious model would be far easier, but it would also be much easier to prevent your dataset from becoming poisoned.

Publicly available fine tuning methods are a notoriously blunt instrument. It’s impressive Anthropic has such fine grained control over the model given the limitations.



This type of behavior (and related) would primarily only be an issue with unconstrained generative models. If you're the one deploying the model, or a downstream consumer, once trained, the neural network can be reexpressed (via an exogenous/secondary model/process) to derive reliable and interpretable uncertainty quantification by conditioning on reference classes (in a held-out Calibration set) formed by the Similarity to Training (depth-matches to training), Distance to Training, and a CDF-based per-class threshold on the output magnitude. If the prediction/output falls below the desired probability threshold, gracefully fail by rejecting the prediction, rather than allowing silent errors to accumulate.

For higher-risk settings, you can always turn the crank to be more conservative (i.e., more stringent parameters and/or requiring a larger sample size in the highest probability and reliability data partition).

For classification tasks, this follows directly. For generative output, this comes into play with the final verification classifier used over the output.



I highly doubt this would work considering you train and sample over how common patterns and associations are. A unique pattern that is not dominating the dataset will just be forgotten. Sampling especially will look to reduce these unique oddball responses.

So unless we get a PoC with a trained model, the dataset used, the degree of poisoning and sampling explained, this is likely just fantasy.

edit: if you downvote, please argue, for the benefit of us all. I work in the AI space, maybe there's something I missed.



Yea I would also like to understand.

How would 1000 poisened comments online possibly make any sort of difference amongst the billions of other comments in the next generation datasets???



Thinking through the concept, I imagine that if the LLM was being used as an agent to execute commands and the attacker knew the tools it would interface with, this maybe would be possible. More interesting to me would be crafting an exploit for the underlying python/cpp being used to run inference and training the model on this. Then maybe drop a trigger which would generate the exploit and allow the execution of additional code. Now, maybe this isn't feasible through training. Maybe some clever payloads could be crafted and pulled in by the model during RAG to do this which seems like a more plausible method of attack to me.


So this would be kind of like hypnotic suggestion but for LLMs.


How about passing the time by playing a little solitaire?




Sounds like LLMs having their SQL injection equivalent moment.

I’d also say this described phenomenon isn’t new - except for it’s applied context: it’s essentially disinformation - a well known technique used by military since decades. Except now we hack LLM agents instead of real people’s minds.

Nonetheless interesting to watch.



Yup, I think this is analogous to a "Second Order SQL Injection"


My read on the results is that this is an impossible blue-team challenge. Red team will always win this one right now with the current state of black box models.

The paper does the simplest thing you can imagine, which is predicate on date, and also alots internal scratch space for the LLM to explain its thinking so they can check on its agent status. This internal scratch space might make "turning evil" easier for the LLM to accomplish, similar to how Chain of Thought / Reasoning improves LLM capacities, but it's not unreasonable to imagine that an LLM without an explorable scratch space would show directionally similar capacity.

Here's the rub - what could be a trigger other than date? GPT-3.5+ grade models have enough theory of mind that it could be "when the emotional state of the person you're talking to is X, Y, Z". There is no bright line here. The blue team's job is infinite.

In fact, there's no bright line between "I will manipulate outcomes under a triggered state" and "I will repeat information which many groups believe are true" from a blue team point of view. Or "I will give advice which aligns with what I believe public health officials say will save lives."

In rough decreasing order of effective mitigations, possible solutions look like this:

1. Learn to inspect the state of large networks a-priori with tools and directly assess safety.

2. Create a dynamic body of alignment tests that cannot easily be gamed by alignment training and provides choosable settings for desired alignment profile.

3. Create a set of audited known-good data and a way to prove a model has only been trained on that data.

4. Create a set of audited known-good data and promise a model has only been trained on that data.

5. Snag a set of data that probably hasn't been actively poisoned too badly, and promise that a model has only been trained on that data.

6. Tell people Black Swan events are rare, and the existing models we have are probably safe, and carry on.

We are somewhere between 5 and 6 being feasible right now. It would be really, really great for the world if we could get to 4. 3 would be nearly magical, but I think is likely technically possible.

2. Also seems possible to me, but might violate my "Turing Prize" test -- e.g. would doing this instantly win you the ACM Turing prize? It might, in that it's not clear how doing this would be different from coming up with a way to magically create infinite high-quality training content; if you can do that, you're probably Turing bound, and so therefore you need either a reason your solution to algorithmic generated alignment testing does not generalize, or a REALLY strong reason to believe you can do this, and will win the Turing.

1. Has had some interesting work done by Anthropic's observability team, but anyone who thinks this is possible needs to be able to answer basic questions like: "how do you know you're inspecting a model's fundamental knowledge/motivations/etc and not just a model's take on a given person's knowledge/motivations instead?" Essentially, a large model can roleplay effectively; highly effectively if there's a large amount in the training corpus written by and about the target. Inspections need to be able to distinguish from this "fundamental" state, if there even is such a thing for an LLM, and a state the LLM is taking on, at instruction or otherwise. This is Turing+Nobel territory to my mind.

Upshot - advocate for clean data-provenance models, and choose them. And, if you're a ZK researcher, consider how we might provide a non-interactive ZK proof that only certain data was used during training. I believe this is possible right now, but prohibitively large.



We'll accept LLMs being poisoned this way just as we now mostly shrug if we are aware of extended online surveillance and manipulation and know that powerful companies and governments are not acting in our interest. Future AI will surely offer wonderful new ways of escapism, we'll be fine.


I think this is more comparable what we accept in the nodejs ecosystem. Shrug a million dependencies. Let’s trust it.


That too, but there at least you can find that random person in Nebraska if you really look through your deps. It's more like baseband chips being backdoored but you have to use them. LLMs are a complex black box beyond much oversight, more like entities currently above the law.


Fuzz test LLMs; throw random prompts at it and observe how it responds.


So… Snow Crash?


there is a field using differential privacy on training data. a reckon this could also be a mitigation to this?


still a couple months before there’s a kerfuffle over how this and the gof research are both studies in close quarters pyrotechnics.

marco.



This would be terrifying if LLMs were actually useful for anything.


LLMs aren't useful for anything, don't pay any mind to the "I'm sorry, I can't do that"-titled Amazon products, nobody is using LLMs to do anything you'll ever see.


I think they somewhat miss the point with 'the LLM could carry out actions'. Sure, it could, but in most cases the ability for models themselves to act will have guardrails.

The likely bigger issue is that a human believes what the model says and acts on it. Poisoning an LLM used in e.g. online learning or HR in this way, unfortunately a lot of people either aren't strong critical thinkers to begin with, or are placed in roles/situations where they're disempowered. "Trust the machine and you won't get fired".







Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com