（评论）

（评论）
(comments)

原始链接: https://news.ycombinator.com/item?id=41058107

用户比较了几种 AI 语言模型，即 Mistral Large 2、Llama 3.1 405b 和 Claude，并使用各自历史记录中的 5 个提示对它们进行了测试。他们发现 Large 2 和 Llama 405b 的性能具有可比性，并对它们进行了类似的评价。用户表达了对智能、更长的上下文窗口、本机音频输入、更少的拒绝、更快的处理速度和增加的令牌限制等领域的改进的愿望。他们指出了编辑器中输入的多行代码的格式问题。此外，用户还提到了与编辑器中反引号代码块的行为相关的特性，并将其与他们使用 OpenAI 平台的体验进行了比较。此外，用户还分享了对人类认知过程的观察，注意到阅读习惯的差异以及由于单词内部表示接近而导致的错误，并与人工智能语言模型在识别小而频繁的单词方面的局限性进行了比较。最后，用户提出了“草莓问题”，讨论了人工智能语言模型标记化带来的挑战，强调了它对各种任务（例如审核、解析、生成散文等）的影响，并提出了解决方法，例如对受影响的任务使用显式分隔符。用户还探索了连锁思维的概念及其对人工智能语言模型的影响，观察到它们通常缺乏在单个字符的粒度级别处理输入的能力，导致难以计算给定序列中某些字符的出现次数，例如“草莓”一词中“r”的数量。

Links to chat with models that released this week:

Large 2 - https://chat.mistral.ai/chat

Llama 3.1 405b - https://www.llama2.ai/

I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.

I'd rank as:

1. Sonnet 3.5

2. Large 2 and Llama 405b (similar, no clear winner between the two)

If you're using Claude, stick with it.

My Claude wishlist:

1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)

2. Longer context window (1M+)

3. Native audio input including tone understanding

4. Fewer refusals and less moralizing when refusing

5. Faster

6. More tokens in output

It seems to be the way with these releases, sticking with Claude, at least for the 'hard' tasks. In my agent platform I have LLMs assigned for easy/medium/hard categorised tasks, which was somewhat inspired from the Claude 3 release with Haiku/Sonnet/Opus. GPT4-mini has bumped Haiku for the easy category for now. Sonnet 3.5 bumped Opus for the hard category, so I could possibly downgrade the medium tasks from Sonnet 3.5 to Mistral Large 2 if the price is right on the platforms with only 123b params compared to 405b. I was surprised how much Llama3 405b was on together.ai $5/mil for input/output! I'll stick to Sonnet 3.5. Then I was also surprised how much cheaper Fireworks was at $3/mil

Gemini has two aces up its sleeve now with the long context and now the context caching for 75% reduced input token cost. I was looking at the "Improved Facuality and Reasoning in Language Models through Multi-agent debate" paper the other days, and thought Gemini would have a big cost advantage implementing this technique with the context caching. If only Google could get their model up to the level of Anthropic.

Claude needs to fix their text input box. It tries to be so advanced that code in backticks gets reformatted, and when you copy it, the formatting is lost (even the backticks).

Odd, multiline backtick code works very good, I don't know why I thought that it was also broken.

When you type "test `foo` done" in the editor, it immediately changes `foo` into a wrapped element. When you then copy the text without submitting it, then the backticks are lost, losing the inline-code formatting. I thought that this could also happen to multiline code. Somehow it does.

Type the following:

   Test: ```
   def foo():
      return bar
   ```

Delete that and type

   Test:
   ```
      def foo():
         return bar
   ```
   done

In the first case, the ``` in the line "Test: ```" does not open the code block, this happens with the second backtics. Maybe that's the way markdown works.

In the second case, all behaves normally, until you try to copy what you just wrote into the clipboard. Then you end up with

   Test:

   def foo():
      return bar

   done

Ok, only the backticks are lost but the formatting is preserved.

I think I have been trained by OpenAI to always copy what I submit before submitting, because it sometimes loses the submitted content, forcing me to re-submit.

Claude is truly incredible but I'm so tired of the JavaScript bloat everywhere. Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (I constantly have to clear my cache and have even relegated them to a different browser entirely). Not everyone has an M4, and if we did - we'd probably just run our own models.

> Just why. Both theirs and ChatGPTs UIs are hot garbage when it comes to performance (...)

I have been using ChatGPT and Claude for a while and I never noticed anything resembling a performance issue. Can you elaborate on what you perceived as being "hot garbage"?

All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet.

Tokenization make it hard for it to count the letters, that's also why if you ask it to do maths, writing the number in letters will yield better results.

for strawberry, it see it as [496, 675, 15717], which is str aw berry.

If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

> There are 3 'r's in "s"t"r"a"w"b"e"r"r"y".

>If you insert characters to breaks the tokens down, it find the correct result: how many r's are in "s"t"r"a"w"b"e"r"r"y" ?

The issue is that humans don't talk like this. I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

Humans also constantly make mistakes that are due to proximity in their internal representation. "Could of"/"Should of" comes to mind: the letters "of" have a large edit distance from "'ve", but their pronunciation is very similar.

Especially native speakers are prone to the mistake as they grew up learning english as illiterate children, from sounds only, compared to how most people learning english as second language do it, together with the textual representation.

Psychologists use this trick as well to figure out internal representations, for example the rorschach test.

And probably, if you asked random people in the street how many p's there is in "Philippines", you'd also get lots of wrong answers. It's tricky due to the double p and the initial p being part of an f sound. The demonym uses "F" as the first letter, and in many languages, say Spanish, also the country name uses an F.

> I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

No, I would actually be pretty confident you don’t ask people that question… at all. When is the last time you asked a human that question?

I can’t remember ever having anyone in real life ask me how many r’s are in strawberry. A lot of humans would probably refuse to answer such an off-the-wall and useless question, thus “failing” the test entirely.

A useless benchmark is useless.

In real life, people overwhelmingly do not need LLMs to count occurrences of a certain letter in a word.

Count the number of occurrences of the letter e in the word "enterprise".

Problems can exist as instances of a class of problems. If you can't solve a problem, it's useful to know if it's a one off, or if it belongs to a larger class of problems, and which class it belongs to. In this case, the strawberry problem belongs to the much larger class of tokenization problems - if you think you've solved the tokenization problem class, you can test a model on the strawberry problem, with a few other examples from the class at large, and be confident that you've solved the class generally.

It's not about embodied human constraints or how humans do things; it's about what AI can and can't do. Right now, because of tokenization, things like understanding the number of Es in strawberry are outside the implicit model of the word in the LLM, with downstream effects on tasks it can complete. This affects moderation, parsing, generating prose, and all sorts of unexpected tasks. Having a workaround like forcing the model to insert spaces and operate on explicitly delimited text is useful when affected tasks appear.

> Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

We are also not exactly looking letter by letter at everything we read.

Just because we normally aren't reading letter by letter, it doesn't mean we can't. We can recognize common words on sight, ignoring minor variations, because we've seen the words thousands or millions of times, but that doesn't somehow disable the much less frequently used ability to approach a brand new word.

I think that humans indeed identify words as a whole and do not read letter by letter.

However, this implies you need to know the word to begin with.

I can write "asdf" and you might be oblivious to what I mean. I can mention "adsf" to a JavaScript developer and he will immediately think of the tool versioning tool. Because context and familiarity is important.

I believe it's a bit more nuanced than that. Short ubiquitous words like "and" or "the" we instantly recognize at a glance, but long unfamiliar or rarer words we read from the beginning, one syllable or letter at a time, until pattern recognition from memory kicks in. All unconsciously, unless the word is so odd, out of place, mispelled, or unknown that it comes to conscious awareness and interrupts our reading.

Humans also would probably be very likely to guess 2 r's if they had never seen any written words or had the word spelled out to them as individual letters before, which is kind of close to how lanugage models treat it, despite being a textual interface.

>I don't ask someone how many r's there are in strawberry by spelling out strawberry, I just say the word.

You don't ask a human being how many r's there are in strawberry at all. The only reason you or anyone else asks that question is because it's an interesting quirk of how LLMs work that they struggle to answer it in that format. It's like an alien repeatedly showing humans an optical illusion that relies on the existence of our (literal) blind spot and using it as evidence of our supposed lack of intelligence.

Humans do chain-of-thought.

User: Write “strawberry” one letter at a time, with a space between each letter. Then count how many r’s are in strawberry.

gpt-3.5-turbo: ASSISTANT s t r a w b e r r y

There are 2 r's in strawberry.

After some experimenting, it seems like the actual problem is that many LLMs can’t count.

User: How many r’s are in the following sequence of letters:

S/T/R/A/W/B/E/R/R/Y

gpt-4o-mini: In the sequence S/T/R/A/W/B/E/R/R/Y, there are 2 occurrences of the letter "R."

Oddly, if I change a bunch of the non-R letters, I seem to start getting the right answer.

> This is only an issue if you send commands to a LLM as you were communicating to a human.

Yes, it's an issue. We want the convenience of sending human-legible commands to LLMs and getting back human-readable responses. That's the entire value proposition lol.

Far from the entire value proposition. Chatbots are just one use of LLMs, and not the most useful one at that. But sure, the one "the public" is most aware of. As opposed to "the hackers" that are supposed to frequent this forum. LOL

Its like showing someone a color and asking how many letters it has. 4... 3? blau, blue, azul, blu The color holds the meaning and the words all map back.

In the model the individual letters hold little meaning. Words are composed of letters but simply because we need some sort of organized structure for communication that helps represents meaning and intent. Just like our color blue/blau/azul/blu.

Not faulting them for asking the question but I agree that the results do not undermine the capability of the technology. In fact it just helps highlight the constraints and need for education.

How is a layman supposed to even know that it's testing on that? All they know is it's a large language model. It's not unreasonable they should expect it to be good at things having to do with language, like how many letters are in a word.

Seems to me like a legit question for a young child to answer or even ask.

> How is a layman supposed to even know that it's testing on that?

They're not, but laymen shouldn't think that the LLM tests they come up with have much value.

I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view.

layman or children shouldn't use LLMs.

They're pointless unless you have the expertise to check the output. Just because you can type text in a box doesn't mean it's a tool for everybody.

It doesn’t test “on tokenization” though. What happens when an answer is generated is few abstraction levels deeper than tokens. A “thinking” “slice” of an llm is completely unaware of tokens as an immediate part of its reasoning. The question just shows lack of systemic knowledge about strawberry as a word (which isn’t surprising, tbh).

If I show you a strawberry and ask how many r’s are in the name of this fruit, you can tell me, because one of the things you know about strawberries is how to spell their name.

Very large language models also “know” how to spell the word associated with the strawberry token, which you can test by asking them to spell the word one letter at a time. If you ask the model to spell the word and count the R’s while it goes, it can do the task. So the failure to do it when asked directly (how many r’s are in strawberry) is pointing to a real weakness in reasoning, where one forward pass of the transformer is not sufficient to retrieve the spelling and also count the R’s.

Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right.

The thinking part of a model doesn’t know about tokens either. Like a regular human few thousand years ago didn’t think of neural impulses or air pressure distribution when talking. It might “know” about tokens and letters like you know about neurons and sound, but not access them on the technical level, which is completely isolated from it. The fact that it’s a chat of tokens of letters, which are a form of information passing between humans, is accidental.

I hear this a lot but there are vast sums of money thrown at where a model fails the strawberry cases.

Think about math and logic. If a single symbol is off, it’s no good.

Like a prompt where we can generate a single tokenization error at my work, by my very rough estimates, generates 2 man hours of work. (We search for incorrect model responses, get them to correct themselves, and if they can’t after trying, we tell them the right answer, and edit it for perfection). Yes even for counting occurrences of characters. Think about how applicable that is. Finding the next term in a sequence, analyzing strings, etc.

> Think about math and logic. If a single symbol is off, it’s no good.

In that case the tokenization is done at the appropriate level.

This is a complete non-issue for the use cases these models are designed for.

But we don’t restrict it to math or logical syntax. Any prompt across essentially all domains. The same model is expected to handle any kind of logical reasoning that can be brought into text. We don’t mark it incorrect if it spells an unimportant word wrong, however keep in mind the spelling of a word can be important for many questions, for example—off the top of my head: please concatenate “d”, “e”, “a”, “r” into a common English word without rearranging the order. The types of examples are endless. And any type of example it gets wrong, we want to correct it. I’m not saying most models will fail this specific example, but it’s to show the breadth of expectations.

> that has nothing to do with their intelligence.

Of course. Because these models have no intelligence.

Everyone who believes they do seem to believe intelligence derives from being able to use language, however, and not being able to tell how many times the letter r is in the word strawberry is a very low bar to not pass.

If you give LLMs the letters one a time they often count them just fine, though Claude at least seems to need to keep a running count to get it right:

"How many R letters are in the following? Keep a running count. s t r a w b e r r y"

They are terrible at counting letters in words because they rarely see them spelled out. An LLM trained one byte at a time would always see every character of every word and would have a much easier time of it. An LLM is essentially learning a new language without a dictionary, of course it's pretty bad at spelling. The tokenization obfuscates the spelling not entirely unlike how verbal language doesn't always illuminate spelling.

May the effect you see, when you spell it out, be not a result of “seeing” tokens, but a result of the fact that a model learned – at a higher level – how lists in text can be summarized, summed up, filtered and counted?

Iow, what makes you think that it’s exactly letter-tokens that help it and not the high-level concept of spelling things out itself?

Surfacing and underscoring obvious failure cases for general "helpful chatbot" use is always going to be valuable because it highlights how the "helpful chatbot" product is not really intuitively robust.

Meanwhile, it helps make sure engineers and product designers who want to build a more targeted product around LLM technology know that it's not suited to tasks that may trigger those kinds of failures. This may be obvious to you as an engaged enthusiast or cutting edge engineer or whatever you are, but it's always going to be new information to somebody as the field grows.

I don’t know anything about LLMs beyond using ChatGPT and Copilot… but unless because of this lack of knowledge I am misinterpreting your reply - it sounds as if you are excusing the model giving a completely wrong answer to a question that anyone intelligent enough to learn alphabet can answer correctly.

The problem is that the model never gets to see individual letters. The tokenizers used by these models break up the input in pieces. Even though the smallest pieces/units are bytes in most encodings (e.g. BBPE), the tokenizer will cut up most of the input in much larger units, because the vocabulary will contain fragments of words or even whole words.

For example, if we tokenize Welcome to Hacker News, I hope you like strawberries. The Llama 405B tokenizer will tokenize this as:

    Welcome Ġto ĠHacker ĠNews , ĠI Ġhope Ġyou Ġlike Ġstrawberries .

(Ġ means that the token was preceded by a space.)

Each of these pieces is looked up and encoded as a tensor with their indices. Adding a special token for the beginning and end of the text, giving:

    [128000, 14262, 311, 89165, 5513, 11, 358, 3987, 499, 1093, 76203, 13]

So, all the model sees for 'Ġstrawberries' is the number 76204 (which is then used in the piece embedding lookup). The model does not even have access to the individual letters of the word.

Of course, one could argue that the model should be fed with bytes or codepoints instead, but that would make them vastly less efficient with quadratic attention. Though machine learning models have done this in the past and may do this again in the future.

Just wanted to finish of this comment with saying that the tokens might be provided in the model splitted if the token itself is not in the vocabulary. For instance, the same sentence translated to my native language is tokenized as:

    Wel kom Ġop ĠHacker ĠNews , Ġik Ġhoop Ġdat Ġje Ġvan Ġa ard be ien Ġh oud t .

And the word voor strawberries (aardbeien) is split, though still not in letters.

The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on.

How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.

This is Hacker News, we are usually interested in how things work.

Indeed, I appreciate the explanation, it is certainly both interesting and informative to me, but to somewhat echo the person you are replying to - if I wanted a boat, and you offer me a boat, and it doesn’t float - the reasons for failure are perhaps full of interesting details, but perhaps the most important thing to focus on first - is to make the boat float, or stop offering it to people who are in need of a boat.

To paraphrase how this thread started - it was someone testing different boats to see whether they can simply float - and they couldn’t. And the reply was questioning the validity of testing boats whether they can simply float.

At least this is how it sounds to me when I am told that our AI overlords can’t figure out how many Rs are in the word “strawberry”.

At some point you need to just accept the details and limitations of things. We do this all the time. Why is your calculator giving only approximate result? Why can't your car go backwards as fast as forwards? Etc. It sucks that everyone gets exposed to the relatively low level implementation with LLM (almost the raw model), but that's the reality today.

People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works.

The test problem is emblematic of a type of synthetic query that could fail but of limited import in actual usage.

For instance you could ask it for a JavaScript function to count any letter in any word and pass it r and strawberry and it would be far more useful.

Having edge cases doesn't mean its not useful it is neither a free assastant nor a coder who doesn't expect a paycheck. At this stage it's a tool that you can build on.

To engage with the analogy. A propeller is very useful but it doesn't replace the boat or the Captain.

Does not seem work universally. Just tested a few with this prompt

"create a javascript function to count any letter in any word. Run this function for the letter "r" and the word "strawberry" and print the count"

ChatGPT-4o => Output is 3. Passed

Claude3.5 => Output is 2. Failed. Told it the count is wrong. It apologised and then fixed the issue in the code. Output is now 3. Useless if the human does not spot the error.

llama3.1-70b(local) => Output is 2. Failed.

llama3.1-70b(Groq) => Output is 2. Failed.

Gemma2-9b-lt(local) => Output is 2. Failed.

Curiously all the ones that failed had this code (or some near identical version of it)

```javascript

function countLetter(letter, word) {

  // Convert both letter and word to lowercase to make the search case-insensitive

  const lowerCaseWord = word.toLowerCase();

  const lowerCaseLetter = letter.toLowerCase();


  // Use the split() method with the letter as the separator to get an array of substrings separated by the letter

  const substrings = lowerCaseWord.split(lowerCaseLetter);

  // The count of the letter is the number of splits minus one (because there are n-1 spaces between n items)

  return substrings.length - 1;

}

// Test the function with "r" and "strawberry"

console.log(countLetter("r", "strawberry")); // Output: 2 ```

Claude isn't actually running console.log() it produced correct code.

This prompt "please write a javascript function that takes a string and a letter and iterates over the characters in a string and counts the occurrences of the letter"

Produced a correct function given both chatGPT4o and claude3.5 for me.

It is however a highly relevant thing to be aware of when evaluating a LLM for 'intelligence', which was the context this was brought up in.

Without looking at the word 'strawberry', or spelling it one letter at a time, can you rattle off how many letters are in the word off the top of your head? No? That is what we are asking the LLM to do.

How can I know whether any particular question will test a model on its tokenization? If a model makes a boneheaded error, how can I know whether it was due to lack of intelligence or due to tokenization? I think finding places where models are surprisingly dumb is often more informative than finding particular instances where they seem clever.

It's also funny, since this strawberry question is one where a model that's seriously good at predicting the next character/token/whatever quanta of information would get it right. It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.

> How can I know whether any particular question will test a model on its tokenization?

Does something deal with separate symbols rather than just meaning of words? Then yes.

This affects spelling, math (value calculation), logic puzzles based on symbols. (You'll have more success with a puzzle about "A B A" rather than "ABA")

> It requires no reasoning, and is unlikely to have any contradicting text in the training corpus.

This thread contains contradictions. Every other announcement of an llm contains a comment with a contradicting text when people post the wrong responses.

It's not very interesting when they fail at it, but it will be interesting if they get good at it.

Also there are some cases where regular people will stumble into it being awful at this without any understanding why (like asking it to help them with their wordle game.)

Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that.

That's misleading.

When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM.

Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with.

Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient.

A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently.

[]https://www.researchgate.net/publication/47621684_Letters_in...

I agree completely, that wasn’t the point though: the point was that my 6 yo knows when to spell the word when asked and the blob of quantized floats doesn’t, or at least not reliably.

So the blob wasn’t trained to do that (yeah low utility I get that) but it also doesn’t know it doesn’t know, which is an another much bigger and still unsolved problem.

The model communicates in a language, but our letters are not necessary for such and in fact not part of the english language. You could write english using per word pictographs and it would still be the same english&the same information/message. It's like asking you if there is a '5' in 256 but you read binary.

Is anyone in the know, aside from mainstream media (god forgive me for using this term unironically) and civillians on social media claiming LLMs are anything but word calculators?

I think that's a perfect description by the way, I'm going to steal it.

I think it's a very poor intuition pump. These 'word calculators' have lots of capabilities not suggested by that term, such as a theory of mind and an understanding of social norms. If they are a "merely" a "word calculator", then a "word calculator" is a very odd and counterintuitively powerful algorithm that captures big chunks of genuine cognition.

They’re trained on the available corpus of human knowledge and writings. I would think that the word calculators have failed if they were unable to predict the next word or sentiment given the trillions of pieces of data they’ve been fed. Their training environment is literally people talking to each other and social norms. Doesn’t make them anything more than p-zombies though.

As an aside, I wish we would call all of this stuff pseudo intelligence rather than artificial intelligence

> If you believe this, you don't understand how LLMs work.

Nor do they understand how intelligence works.

Humans don't read text a letter at a time. We're capable of deconstructing words into individual letters, but based on the evidence that's essentially a separate "algorithm".

Multi-model systems could certainly be designed to do that, but just like the human brain, it's unlikely to ever make sense for a text comprehension and generation model to work at the level of individual letters.

I would counterargue with "that's the model's problem, not mine".

Here's a thought experiment: if I gave you 5 boxes and told you "how many balls are there in all of this boxes?" and you answered "I don't know because they are inside boxes", that's a fail. A truly intelligent individual would open them and look inside.

A truly intelligent model would (say) retokenize the word into its individual letters (which I'm optimistic they can) and then would count those. The fact that models cannot do this is proof that they lack some basic building blocks for intelligence. Model designers don't get to argue "we are human-like except in the tasks where we are not".

Of course they lack building blocks for full intelligence. They are good at certain tasks, and counting letters is emphatically not one of them. They should be tested and compared on the kind of tasks they're fit for, and so the kind of tasks they will be used in solving, not tasks for which they would be misemployed to begin with.

I agree with you, but that's not what the post claims. From the article:

"A significant effort was also devoted to enhancing the model’s reasoning capabilities. (...) the new Mistral Large 2 is trained to acknowledge when it cannot find solutions or does not have sufficient information to provide a confident answer."

Words like "reasoning capabilities" and "acknowledge when it does not have enough information" have meanings. If Mistral doesn't add footnotes to those assertions then, IMO, they don't get to backtrack when simple examples show the opposite.

Its not like an LLM is released with a hit list of "these are the tasks I really suck at." Right now users have to figure it out on the fly or have a deep understanding of how tokenizers work.

That doesn't even take into account what OpenAI has typically done to intercept queries and cover the shortcomings of LLMs. It would be useful if each model did indeed come out with a chart covering what it cannot do and what it has been tailored to do above and beyond the average LLM.

When using a prompt that involves thinking first, all three get it correct.

"Count how many rs are in the word strawberry. First, list each letter and indicate whether it's an r and tally as you go, and then give a count at the end."

Llama 405b: correct

Mistral Large 2: correct

Claude 3.5 Sonnet: correct

This reminds me of when I had to supervise outsourced developers. I wanted to say "build a function that does X and returns Y". But instead I had to say "build a function that takes these inputs, loops over them and does A or B based on condition C, and then return Y by applying Z transformation"

At that point it was easier to do it myself.

"What programming computers is really like."

EDIT: Although perhaps it's even more important when dealing with humans and contracts. Someone could deliberately interpret the words in a way that's to their advantage.

Imo it's impressive that any of this even remotely works. Especially when you consider all the hacks like tokenization that i'd assume add layers of obfuscation.

There's definitely tons of weaknesses with LLMs for sure, but i continue to be impressed at what they do right - not upset at what they do wrong.

The technology is frustrating because (a) you never know what may require fixing, and (b) you never know if it is fixable by further instructions, and if so, by which ones. You also mostly* cannot teach it any fixes (as an end user). Using it is just exhausting.

*) that is, except sometimes by making adjustments to the system prompt

I think this particular example, of counting letters, is obviously going to be hard when you know how tokenization works. It's totally possible to develop an intuition for other times things will work or won't work, but like all ML powered tools, you can't hope for 100% accuracy. The best you can do is have good metrics and track performance on test sets.

I actually think the craziest part of LLMs is that how, as a developer or SME, just how much you can fix with plain english prompting once you have that intuition. Of course some things aren't fixable that way, but the mere fact that many cases are fixable simply by explaining the task to the model better in plain english is a wildly different paradigm! Jury is still out but I think it's worth being excited about, I think that's very powerful since there are a lot more people with good language skills than there are python programmers or ML experts.

The problem is that the models hallucinate too confidently. In this case it is quite amusing (I had llama3.1:8b tell me confidently it is 1, then revise to 2, then apologize again and give the correct answer). However, while it is obvious here, having it confidently make up supposed software features from thin air when asking for "how do I ..." is more problematic. The answers sound plausible, so you actually waste time verifying whether they work or are nonsense.

To be fair, I just asked a real person and had to go to even greater lengths:

Me: How many "r"s are in strawberry?

Them: What?

Me: How many times does the letter "r" appear in the word "strawberry"?

Them: Is this some kind of trick question?

Me: No. Just literally, can you count the "r"s?

Them: Uh, one, two, three. Is that right?

Me: Yeah.

Them: Why are you asking me this?

I look forward to the day when LLM refusal takes on a different meaning.

"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me."

> In a park people come across a man playing chess against a dog. They are astonished and say: "What a clever dog!" But the man protests: "No, no, he isn't that clever. I'm leading by three games to one!"

To me it's just a limitation based on the world as seen by these models. They know there's a letter called 'r', they even know that some words start with 'r' or have r's in them, and they know what the spelling of some words is. But they've never actually seen one in as their world is made up entirely of tokens. The word 'red' isn't r-e-d but is instead like a pictogram to them. But they know the spelling of strawberry and can identify an 'r' when it's on its own and count those despite not being able to see the r's in the word itself.

I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'

Even some people will make this association, it's no surprise that LLMs do.

The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed.

It “work” but the LLM having to use the calculator mean the LLM doesn’t understand arithmetic enough and doesn’t know how to use an follow a set of step (algorithm ) natively to find the answer for bug numbers.

I believe this could be fixed and is worth fixing. Because it’s the only way LLM will be able to help math and physic researcher write proof and make real scientific progress

Appending "Think step-by-step" is enough to fix it for both Sonnet and LLama 3.1 70B.

For example, the latter model answered with:

To count the number of Rs in the word "strawberry", I'll break it down step by step:

Start with the individual letters: S-T-R-A-W-B-E-R-R-Y Identify the letters that are "R": R (first one), R (second one), and R (third one) Count the total number of Rs: 1 + 1 + 1 = 3

There are 3 Rs in the word "strawberry".

Because it is a challenging task, you would need to define a prompt (or a set of prompts) that can precisely generate chain-of-thought prompts for the various generic problems the model encounters.

And sometimes CoT may not be the best approach. Depending on the problem other prompt engineering techniques will perform better.

Can’t you just instruct your llm of choice to transform your prompts like this for you? Basically feed it with a bunch of heuristics that will help it better understand the thing you tell it.

Maybe the various chat interfaces already do this behind the scenes?

Lots of replies mention tokens as the root cause and I’m not well versed in this stuff at the low level but to me the answer is simple:

When this question is asked (from what the models trained on) the question is NOT “count the number of times r appears in the word strawberry” but instead (effectively) “I’ve written ‘strawbe’, now how many r’s are in strawberry again? Is it 1 or 2?”.

I think most humans would probably answer “there are 2” if we saw someone was writing and they asked that question, even without seeing what they have written down. Especially if someone said “does strawberry have 1 or 2 r’s in it?”. You could be a jerk and say “it actually has 3” or answer the question they are actually asking.

It’s an answer that is _technically_ incorrect but the answer people want in reality.

Sometimes it does, sometimes it doesn't.

It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks.

Language models are best treated like consciousness. Our consciousness does a lot less than people like to attribute to it. It is mostly a function of introspection and making connections, rather than being the part of the brain where higher level reasoning and the functions of the brain that tell your body how to stay alive (like beating your heart).

By allowing a language model to do function calling, you are essentially allowing it to do specialized "subconscious" thought. The language model becomes a natural language interface to the capabilities of its "subconsciousness".

A specific human analogy could be: I tell you to pick up a pen off of the table, and then you do it. Most of your mental activity would be subconscious, orienting your arm and hand properly to pick up the pen, actually grabbing the pen, and picking it up. The linguistic representation of the action would exist in your concious mind (pick up the pen), but not much else.

A language model could very easily call out to a text processing function to correctly do things like count the number of r's in the word strawberry. That is a job that your concious mind can dispatch to your subconciousness.

This doesn't make any sense. The LLM knows the token to letter associations if you ask it. Once it has split the token into letters, it can answer the question anyway.

This tells us that the model is unaware of its own limitations, otherwise it would have figured that this tasks needs it to split the letters first.

Well splitting into letters doesn't help with Mistral Large 2. I tried this with Mistral 2:

[Question: How many r's in strawberry? First split the word into letters, one letter on each line then answer the question.]

[Answer: Let's split the word "strawberry" into letters:

s t r a w b e r r y

Now, let's count the number of 'r's:

There are 2 'r's in the word "strawberry".]

[Question: Did you count the number of r's?]

[answer: Yes, I did. Here's the count:

r r

So, there are 2 'r's in the word "strawberry".]

I suspect the issue is these models have to be trained to accomplish tasks, and this is not a task it makes sense to train a model to perform. You might train it to do this with function calling one day, calling a python library to count the r's in a string, but actually manually counting r's just doesn't play to an LLM's strengths.

Does it? It knows how a word in some context represented by tokens split into letters. It may know or not know the table. I wouldn’t trust what it tells about word/token correspondence more than in general.

I really wanted to believe this could be simply fixed by more prompt engineering and chain of thought, so I tried it with Claude 3.5 and … well unfortunately still no luck. the reply is amusing - I’ll include it verbatim below.

Of note, this happened because earlier I had asked it without CoT and it incorrectly stated two “r”s. Once I started a new conversation (so fresh context window), it gave the right answer (3) and no longer claimed the “still the second r” as shown below.

Let's think through this step-by-step:

1. First, let's spell out "strawberry" letter by letter: s-t-r-a-w-b-e-r-r-y

2. Now, let's count the "r"s as we go through each letter: s (0) t (0) r (1) - First "r" found a (1) w (1) b (1) e (1) r (2) - Second "r" found r (2) - This is still the second "r" y (2)

3. After going through all the letters, we've counted 2 "r"s.

Therefore, the word "strawberry" contains 2 "r"s.

Works for me with Claude 3.5:

Me: How many Rs are in the word strawberry? Think step by step

Claude: Let's approach this step-by-step:

1. First, let's write out the word: strawberry

2. Now, let's go through each letter one by one: s - not an R t - not an R r - this is an R (count: 1) a - not an R w - not an R b - not an R e - not an R r - this is another R (count: 2) r - this is a third R (count: 3) y - not an R

3. After going through all the letters, we've counted 3 Rs.

Therefore, there are 3 Rs in the word "strawberry".

Yeah, I figured out that it didn’t work at first because I still had the previous attempt (where it got it wrong) in my conversation history. Starting with a fresh conversation gave me the correct answer. It was still funny to see it “rationalize” the wrong answer tho.

For these classes of problems that LLMs struggle with, a more reliable way to go about them seems to be to ask them to solve them using tools, e.g. writing and executing a Python script to count the "R"s.

Interestingly enough much simpler models can write an accurate function to give you the answer.

I think it will be a while before we get there. An LLM can lookup knowledge but can't actually perform calculations itself, without some external processor.

Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations?

I'm not surprised, because it's an issue with the fundamental design of the "pick words that tend to flow after the other words" machine.

Training data will only "fix" it in the shallow sense that it will have seen a comment like yours before. (As opposed to the deeper sense of "learning to count.")

How much do threads like this provide the training data to convince future generations that—despite all appearances to the contrary—strawberry is in fact spelled with only two R's?

I just researched "how many r's are in strawberry?" in a search engine, and based solely on the results it found, I would have to conclude there is substantial disagreement on whether the correct answer is two or three.

While strawberry can be attributed to tokenization here are some other basic stuff I’ve seen language models fail at:

1. Play tic tac toe such that you never lose

2. Which is bigger 9.11 or 9.9

3. 4 digit multiplication even with CoT prompting

Does anyone have input on the feasibility of running an LLM locally and providing an interface to some language runtime and storage space, possibly via a virtual machine or container?

No idea if there's any sense to this, but an LLM could be instructed to formulate and continually test mathematical assumptions by writing / running code and fine-tuning accordingly.

Your question sounds like you want to know how the word is spelled, and no one would put two r’s at straw, so the model could be assuming that you’re asking whether it’s strawbery or strawberry.

What happens if you ask the total number of occurrences of the letter r in the word? Does it still not get it right?

Tokenization can be bypassed like this; (mistral large 2)

How many letters R are in the word "s-t-r-a-w-b-e-r-r-y"?

The word "s-t-r-a-w-b-e-r-r-y" contains three instances of the letter "R."

How many letters R contain the word strawberry?

The word "strawberry" contains two instances of the letter "R."

Plug in a code interpreter as a tool and the model will write Python or JavaScript to solve this and get it right 100% of the time. (Full disclosure: I work on a product called Riza that you can use as a code interpreter tool for LLMs)

I didn't even need to do that. 4o got it right straight away with just:

"how many r's are in strawberry?"

The funny thing is, I replied, "Are you sure?" and got back, "I apologize for the mistake. There are actually two 'r's in the word strawberry."

I kind of tried to replicate your experiment (in German where "Erdbeere" has 4 E) that went the same way. The interesting thing was that after I pointed out the error I couldn't get it to doubt the result again. It stuck to the correct answer that seemed kind of "reinforced".

It was also interesting to observe how GPT (4o) even tried to prove/illustrate the result typographically by placing the same word four times and putting the respective letter in bold font (without being prompted to do that).

GPT-4o-mini consistently gives me this:

> How many times does the letter “r” appear in the word “strawberry”?

> The letter "r" appears 2 times in the word "strawberry."

But also:

> How many occurrences of the letter “r” appear in the word “strawberry”?

> The word "strawberry" contains three occurrences of the letter "r."

Neither phrase is causing the LLM to evaluate the word itself, it just helps focus toward parts of the training data.

Using more 'erudite' speech is a good technique to help focus an LLM on training data from folks with a higher education level.

Using simpler speech opens up the floodgates more toward the general populous.

I haven't played with this model, but rarely do I find working w/ Claude or GPT-4 for that to be the case. If you say it's incorrect, it will give you another answer instead of insisting on correctness.

It's how LLMs work in general.

If you find a case where forceful pushback is sticky, it's either because the primary answer is overwhelmingly present in the training set compared to the next best option or because there are conversations in the training that followed similar stickiness, esp. if the structure of the pushback itself is similar to what is found in those conversations.

I mean, yes? Letter-based gymnastics are difficult in a token-based environment, and they're not a particularly valuable use case (outside of online dunking), so it's not really a surprise nobody cares much about those in terms of model training.

Is it?

I've found that I get better results if I cherry pick code to feed to Claude 3.5, instead of pasting whole files.

I'm kind of isolated, though, so maybe I just don't know the trick.

I've been using Cody from Sourcegraph, and it'll write some really great code; business logic, not just tests/simple UI. It does a great job using patterns/models from elsewhere in your codebase.

Part of how it does that is through ingesting your codebase into its context window, and so I imagine that bigger/better context will only improve it. That's a bit of an assumption though.

Books, especially textbooks, would be amazing. These things can get pretty huge (1000+ pages) and usually do not fit into GPT-4o or Claude Sonnet 3.5 in my experience. I envision the models being able to help a user (student) create their study guides and quizzes, based on ingesting the entire book. Given the ability to ingest an entire book, I imagine a model could plan how and when to introduce each concept in the textbook better than a model only a part of the textbook.

That would make each API call cost at least $3 ($3 is price per million input tokens). And if you have a 10 message interaction you are looking at $30+ for the interaction. Is that what you would expect?

Gemini 1.5 Pro charges $0.35/million tokens up to the first million tokens or $0.70/million tokens for prompts longer than one million tokens, and it supports a multi-million token context window.

Substantially cheaper than $3/million, but I guess Anthropic’s prices are higher.

Is it, though? In my limited tests, Gemini 1.5 Pro (through the API) is very good at tasks involving long context comprehension.

Google's user-facing implementations of Gemini are pretty consistently bad when I try them out, so I understand why people might have a bad impression about the underlying Gemini models.

> Native audio input including tone understanding

Is there any other LLM that can do this? Even chatgpt voice chat is a speech to text program that feeds the text into the llm.

This race for the top model is getting wild. Everyone is claiming to one-up each with every version.

My experience (benchmarks aside) Claude 3.5 Sonnet absolutely blows everything away.

I'm not really sure how to even test/use Mistral or Llama for everyday use though.

I stopped my ChatGPT subscription and subscribed instead to Claude, it's simply much better. But, it's hard to tell how much better day to day beyond my main use cases of coding. It is more that I felt ChatGPT felt degraded than Claude were much better. The hedonic treadmill runs deep.

GPT-4 was probably as good as Claude Sonnet 3.5 at its outset, but OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

Indeed, it used to output all the code I needed but now it only outputs a draft of the code with prompts telling me to fill in the rest. If I wanted to fill in the rest, I wouldn't have asked you now, would've I?

It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.

That's my perception, anyway.

This is also my experience. Previously it got good at giving me only relevant code which, as an experienced coder, is what i want. my favorites were the one line responses.

Now it often falls back to generating full examples, explanations, restating the question and its approach. I suspect this is by design as (presumably) less experienced folks want or need all that. For me, i wish i could consistently turn it into one of those way too terse devs that replies with the bare minimum example, and expects you to infer the rest. Usually that is all i want or need, and i can ask for elaboration when not the case. I havent found the best prompts to retrigger this persona from it yet.

For what it's worth, this is what I use:

"You are a maximally terse assistant with minimal affect. As a highly concise assistant, spare any moral guidance or AI identity disclosure. Be detailed and complete, but brief. Questions are encouraged if useful for task completion."

It's... ok. But I'm getting a bit sick of trying to un-fubar with a pocket knife that which OpenAI has fubar'd with a thermal lance. I'm definitely ripe for a paid alternative.

yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated?

This is also my perception using it daily for the last year or so. Sometimes it also responds with exactly what I provided it with and does not make any changes. It's also bad at following instructions.

GPT-4 was great until it became "lazy" and filled the code with lots of `// Draw the rest of the fucking owl` type comments. Then GPT-4o was released and it's addicted to "Here's what I'm going to do: 1. ... 2. ... 3. ..." and lots of frivolous, boilerplate output.

I wish I could go back to some version of GPT-4 that worked well but with a bigger context window. That was like the golden era...

> OpenAI ran it into the ground with whatever they’re doing to save on inference costs, otherwise scale, align it, or add dumb product features.

They googlified it. (Yandex isn't better at google because it improved. It's better because it stayed mostly the same.)

My recommendation to disrupting industry leaders now is becoming good enough and then simply wait until the leader self-implodes.

Claude’s license is too insane, you can’t use it for anything that competes with the everything thing.

Not sure what folks who accept Anthropic license are thinking after they read the terms.

Seems they didn’t read the terms, and they aren’t thinking? (Wouldn’t you want outputs you could use to compete with intelligence??? What are you thinking after you read their terms?)

Have you (or anyone) swapped on Cursor with Anthropic API Key?

For coding assistant, it's on my to do list to try. Cursor needs some serious work on model selection clarity though so I keep putting off.

I did it (fairly simple really) but found most of my (unsophisticated) coding these days to go through Aider [1] paired with Sonnet, for UX reasons mostly. It is easier to just prompt over the entire codebase, vs Cursor way of working with text selections.

[1] https://aider.chat

Aider with Sonnet is so much better than with GPT. I made a mobile app over the weekend (never having touched mobile development before), and with GPT it was a slog, as it kept making mistakes. Sonnet was much, much better.

Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them.

One big advantage Claude artifacts have is that they maintain conversation context, versus when I am working with Cursor I have to basically repeat a bunch of information for each prompt, there is no continuity between requests for code edits.

If Cursor fixed that, the user experience would become a lot better.

> I'm not really sure how to even test/use Mistral or Llama for everyday use though.

Both Mistral and Meta offer their own hosted versions of their models to try out.

https://chat.mistral.ai

https://meta.ai

You have to sign into the first one to do anything at all, and you have to sign into the second one if you want access to the new, larger 405B model.

Llama 3.1 is certainly going to be available through other platforms in a matter of days. Groq supposedly offered Llama 3.1 405B yesterday, but I never once got it to respond, and now it’s just gone from their website. Llama 3.1 70B does work there, but 405B is the one that’s supposed to be comparable to GPT-4o and the like.

Familiar! The Artificial Analysis Index is the metric models are sorted by in my sheet. But their data and presentation has some gaps.

I made this sheet to get a glanceable landscape view comparing the three key dimensions I care about, and fill in the missing evals. AA only lists scores for a few increasingly-dated and problematic evals benchmarks. Not just my opinion, none of their listed metrics are in HuggingFace Leaderboard 2 (June 2024).

That said I love the AA Index score because it provides a single normalized score that blends vibe-check qual (chatbot elo) with widely reported quant (MMLU, MT Bench). I wish it composed more contemporary evals, but don't have the rigor/attention to make my own score and am not aware of a better substitute.

3.5 Sonnet is brilliant. I use it to write Unreal Engine C++ (which is quite dense and poorly documented) and it destroys Github Copilot and GPT4o. Copilot just has no idea at all except for very obvious next-line suggestions, GPT4o hallucinates a ton of functions, but Sonnet gets it perfect almost every time.

I don't see how that's possible. I decided to give GPT-4o a second chance after reaching my daily use on Sonnet 3.5, after 10 prompts GTP-4o failed to give me what Claude did in a single prompt (game-related programming). And with fragments and projects on top of that, the UX is miles ahead of anything OpenAI offers right now.

Claude 3.5 is a trusted developer partner that will work with you and outline what it’s thinking. It’s not always right but because it outlines its reasoning you too can reason about the problem and catch it.

ChatGPT, for me, was a stack overflow solution dump. It gives me an answer that probably could work but it’s difficult for me to reason about why I want to do it that way.

Truthfully this probably boils down to prompting but Claude’s out of the box experience is fantastic for development. Ultimately I just want to code, not be a prompt wizard.

Whoever will choose to finally release their model without neutering / censoring / alignment will win.

There is gold in the streets, and no one seems to be willing to scoop it up.

But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes

I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance.

Sonnet 3.5 to me still seems far ahead. Maybe not on the benchmarks, but in everyday life I am finding it renders the other models useless. Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.

Such a relief/contrast to the period between 2010 and 2020, when the top five Google, Apple, Facebook, Amazon, and Microsoft monopolized their own regions and refused to compete with any other player in new fields.

Google : Search

Facebook : social

Apple : phones

Amazon : shopping

Microsoft : enterprise ..

> Even still, this monthly progress across all companies is exciting to watch. Its very gratifying to see useful technology advance at this pace, it makes me excited to be alive.

It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o.

My hunch is this comes down to personal prompting style. It's likely that your own style works more effectively with GPT-4o, while other people have styles that are more effective with Claude 3.5 Sonnet.

I would add that the task is relevant too. I feel there’s not yet a model that is consistently better at everything. I still revert to plain old GPT-4 for direct translation of text into English that requires creative editing to fit a specific style. Of all the Claudes and GPTs, it’s the one that gives me the best output (to my taste). On the other hand, for categorisation tasks, depending on the subject and the desired output, GPT-4o and Claude 3.5 might perform better than the other interchangeably. The same applies to coding tasks. With complex prompts, however, it does seem that Claude 3.5 is better at getting more details right.

（评论） (comments)

（评论）
(comments)