![]() |
|
![]() |
| Claude needs to fix their text input box. It tries to be so advanced that code in backticks gets reformatted, and when you copy it, the formatting is lost (even the backticks). |
![]() |
| All 3 models you ranked cannot get "how many r's are in strawberry?" correct. They all claim 2 r's unless you press them. With all the training data I'm surprised none of them fixed this yet. |
![]() |
| > How is a layman supposed to even know that it's testing on that?
They're not, but laymen shouldn't think that the LLM tests they come up with have much value. |
![]() |
| I'm saying a layman or say a child wouldn't even think this is a "test". They are just asking a language model a seemingly simple language related question from their point of view. |
![]() |
| layman or children shouldn't use LLMs.
They're pointless unless you have the expertise to check the output. Just because you can type text in a box doesn't mean it's a tool for everybody. |
![]() |
| Sure, that's a different issue. If you prompt in a way to invoke chain of thought (e.g. what humans would do internally before answering) all of the models I just tested got it right. |
![]() |
| The thing is, how the tokenizing work is about as relevant to the person asking the question as name of the cat of the delivery guy who delivered the GPU that the llm runs on. |
![]() |
| How the tokenizer works explains why a model can’t answer the question, what the name of the cat is doesn’t explain anything.
This is Hacker News, we are usually interested in how things work. |
![]() |
| People do get similarly hung up on surprising floating point results: why can't you just make it work properly? And a full answer is a whole book on how floating point math works. |
![]() |
| Call me when models understand when to convert the token into actual letters and count them. Can’t claim they’re more than word calculators before that. |
![]() |
| That's misleading.
When you read and comprehend text, you don't read it letter by letter, unless you have a severe reading disability. Your ability to comprehend text works more like an LLM. Essentially, you can compare the human brain to a multi-model or modular system. There are layers or modules involved in most complex tasks. When reading, you recognize multiple letters at a time[], and those letters are essentially assembled into tokens that a different part of your brain can deal with. Breaking down words into letters is essentially a separate "algorithm". Just like your brain, it's likely to never make sense for a text comprehension and generation model to operate at the level of letters - it's inefficient. A multi-modal model with a dedicated model for handling individual letters could easily convert tokens into letters and operate on them when needed. It's just not a high priority for most use cases currently. []https://www.researchgate.net/publication/47621684_Letters_in... |
![]() |
| I look forward to the day when LLM refusal takes on a different meaning.
"No, I don't think I shall answer that. The question is too basic, and you know better than to insult me." |
![]() |
| I think it's more that the question is not unlike "is there a double r in strawberry?' or 'is the r in strawberry doubled?'
Even some people will make this association, it's no surprise that LLMs do. |
![]() |
| The great-parent demonstrates that they are nevertheless capable of doing so, but not without special instructions. Your elaboration doesn’t explain why the special instructions are needed. |
![]() |
| Sometimes it does, sometimes it doesn't.
It is evidence that LLMs aren't appropriate for everything, and that there could exist something that works better for some tasks. |
![]() |
| Why do we have to "get there?" Humans use calculators all the time, so why not have every LLM hooked up to a calculator or code interpreter as a tool to use in these exact situations? |
![]() |
| Is it?
I've found that I get better results if I cherry pick code to feed to Claude 3.5, instead of pasting whole files. I'm kind of isolated, though, so maybe I just don't know the trick. |
![]() |
| > Native audio input including tone understanding
Is there any other LLM that can do this? Even chatgpt voice chat is a speech to text program that feeds the text into the llm. |
![]() |
| It's doing something different for me. It seems almost desperate to generate vast chunks of boilerplate code that are only tangentially related to the question.
That's my perception, anyway. |
![]() |
| yeah but you can’t use your code from either model to compete with either company, and they do everything. wtf is wrong with AI hype enjoyers they accept being intellectually dominated? |
![]() |
| Thanks for this suggestion. If anyone has other suggestions for working with large code context windows and changing code workflows, I would love to hear about them. |
![]() |
| Whoever will choose to finally release their model without neutering / censoring / alignment will win.
There is gold in the streets, and no one seems to be willing to scoop it up. |
![]() |
| But projects are great in Sonnet, you just dump db schema some core file and you can figure stuff out quickly. I guess Aider is similar but i was lacking good history of chats and changes |
![]() |
| I recommend using a UI that you can just use whatever models you want. OpenWebUI can use anything OpenAI compatible. I have mine hooked up to Groq and Mistral, in addition to my Ollama instance. |
![]() |
| It’s these kind of praise that makes me wonder if they are all paid to give glowing reviews, this is not my experience with sonnet at all. It absolutely does not blow away gpt4o. |
Large 2 - https://chat.mistral.ai/chat
Llama 3.1 405b - https://www.llama2.ai/
I just tested Mistral Large 2 and Llama 3.1 405b on 5 prompts from my Claude history.
I'd rank as:
1. Sonnet 3.5
2. Large 2 and Llama 405b (similar, no clear winner between the two)
If you're using Claude, stick with it.
My Claude wishlist:
1. Smarter (yes, it's the most intelligent, and yes, I wish it was far smarter still)
2. Longer context window (1M+)
3. Native audio input including tone understanding
4. Fewer refusals and less moralizing when refusing
5. Faster
6. More tokens in output