![]() |
|
![]() |
| I feel the webpage strongly hints that sparse autoencoders were invented by OpenAI for this project.
Very weird that they don't cite this in their webpage and instead bury the source in their paper. |
![]() |
| "We don't know how X works" literally means "we don't have models yet that can explain X's behavior".
TFA is about making a tiny bit of progress towards such models. Perhaps you should read it. |
![]() |
| LLMs aren't the only kind of AI, just one of the two current shiny kinds.
If a "cure for cancer" (cancer is not just one disease so, unfortunately, that's not even as coherent a request as we'd all like it to be) is what you're hoping for, look instead at the stuff like AlphaFold etc.: https://en.wikipedia.org/wiki/AlphaFold I don't know how to tell where real science ends and PR bluster begins in such models, though I can say that the closest I've heard to a word against it is "sure, but we've got other things besides protein folding to solve", which is a good sign. (I assume AlphaFold is also a mysterious black box, and that tools such as the one under discussion may help us demystify it too). |
![]() |
| > Why would I want to use an AI to win a game of chess?
https://www.bbc.com/news/world-63043023 Though such a question misses the point: use the right tool for the job. (For a non-ML example, I don't know why 5/8 wrenches exist, but I'm confident of two things: (1) that's a very specific size, unlikely to be useful for a different sized problem; (2) using one as a hammer would be sub-optimal). I'm not interested in creating a taxonomy of special purpose AI models which each do one thing well and nothing else. What I can do is give a handful of famous examples, such as chess. Other choices at my disposal (but purely off the top of my head and in no way systematic) include the use of OCR to read and hence sort post faster than any human and also more accurately than all but the very best. Or in food processing for quality control (I passed up on a student work placement for that 20 years ago). Or the entirity of Google search, Gmail's spam filters, Maps' route finding and at least some of its knowledge of house numbers, their CAPTCHA system, and Translate (the one system on this list which is fundamentally the same an LLM). Or ANPR. It's like you're saying "food is bad" because you don't like cabbage — the dislike is a valid preference, of course it is, but it doesn't lead to the conclusion. |
![]() |
| > What, you aren’t allowed to own kitchen knives?
You're not allowed to be in possession of a knife in public without a good reason. https://www.gov.uk/buying-carrying-knives You may think the UK government is nuts (I do, I left due to an unrelated law), but it is what it is. > Or Google search somehow doesn’t return the chemical processes to make Sarin? You're still missing the point of everything I've said if you think that's even a good rhetorical question. I have no idea if that's me giving bad descriptions, or you being primed with the exact false world model I'm trying to convince you to change from. Hill climbing sometimes involves going down one local peak before you can climb the global. Again, and I don't know how to make this clearer, I am not calling for an undifferentiated ban on all AI just because they can be used for bad ends, I'm saying that we need to figure out how to even tell which uses are even the bad ones. Your original text was: > We know exactly what the system is capable of doing. It’s capable of outputting tokens which can then be converted into text Well, we know exactly what a knife is capable of doing. Does that knowledge mean we allow stabbing? Of course not! What's the AI equivalent of a stabbing? Nobody knows. |
![]() |
| But even with current efforts so far, I don't think we have an understanding of how/why these emergent capabilities are formed. LLMs are still a black box as ever. |
![]() |
| LLM based AIs have lots of "features" which are kind of synonymous with "concepts" - these can be anything from `the concept of an apostrophe in the word don't`, to `"George Wash" is usually followed by "ington" in the context of early American History`. Inside of the LLMs neural network, these are mapped to some circuitry-in-software-esque paths.
We don't really have a good way of understanding how these features are generated inside of the LLMs or how their circuitry is activated when outputting them, or why the LLMs are following those circuits. Because of this, we don't have any way to debug this component of an LLM - which makes them harder to improve. Similarly, if LLMs/AIs ever get advanced enough, we'll want to be able to identify if they're being wilfully deceptive towards us, which we can't currently do. For these reasons, we'd like to understand what is actually happening in the neural network to produce & output concepts. This domain of research is usually referred to as "interpretability". OpenAI (and also DeepMind and Anthropic) have found a few ways to inspect the inner circuitry of the LLMs, and reveal a handful of these features. They do this by asking questions of the model, and then inspecting which parts of the LLM's inner circuitry "lights up". They then ablate (turn off) circuitry to see if those features become less frequently used in the AIs response as a verification step. The graphs and highlighted words are visual representations of concepts that they are reasonably certain about - for example, the concept of the word "AND" linking two parts of a sentence together highlights the word "AND". Neel Nanda is the best source for this info if you're interested in interpretability (IMO it's the most interesting software problem out there at the moment), but note that his approach is different to OpenAI's methodology discussed in the post: https://www.neelnanda.io/mechanistic-interpretability |
![]() |
| In general this is just copying work done by Anthropic, so there's nothing fundamentally new here.
What they have done here is to identify patterns internal to GPT-4 that correspond to specific identifiable concepts. The work was done my OpenAI's mostly dismantled safety team (it has the names of this teams recently departed co-leads Ilya & Jan Leike on it), so this is nominally being done for safety reasons to be able to boost or suppress specific concepts from being activated when the model is running, such as Anthropic's demonstration of boosting their models fixation on the Golden Gate bridge: https://www.anthropic.com/news/golden-gate-claude This kind of work would also seem to have potential functional uses as well as safety ones, given that it allow you to control the model in specific ways. |
![]() |
| Yes, thank you, I understand that part.
It's the might condition in the description that makes me think the results might not be the exact same as what's used in the live models. |
![]() |
| Up in arms for what reason? That neural networks are perfectly interpretable? That's the nature of huge amorphous deep networks. These feature extraction forays are a good step forward though. |
![]() |
| Basically everyone. You, and me, and Elon Musk, and EMPRESS, and my uncle who works for Nintendo. They're just hoping that AI training legally ignores copyright. |
![]() |
| There is a very good argument to be made that training AI is fair use, as it is both transformative and does not compete with the original work. This has yet to be tested in court. |
![]() |
| Curious what interesting ways have you used chatGPT ... yesterday i used it to test my pool levels (took a pic of my pool strip i had laying around and told it the brand of pool strip). |
![]() |
| The worrying part is that first concept in the doc they show/found is "human imperfection". Hope this is just coincidence.. |
![]() |
| I think it's done on purpose, it's related to a very important point when understanding AI.
Humans aren't perfect, AI is trained by humans, therefore... |
> GPT-4 feature: ends of phrases related to price increases
and the 2/5s of the responses don't have any relation to increase at all:
> Brent crude, fell 38 cents to $118.29 a barrel on the ICE Futures Exchange in London. The U.S. benchmark, West Texas Intermediate crude, was down 53 cents to $99.34 a barrel on the New York Mercantile Exchange. -- Ronald D. White Graphic: The AAA
and
> ,115.18. The record reflects that appellant also included several hand-prepared invoices and employee pay slips, including an allegedly un-invoiced laundry ticket dated 29 June 2013 for 53 bags oflaundry weighing 478 pounds, which, at the contract price of $
I think I must be mis-understanding something. Why would this example (out of all the potential examples) be picked?