(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=39590666

根据讨论,LLM(语言模型)近年来似乎逐渐变得更加准确和复杂。 OpenAI 开发的 GPT-4 和 CLUED3 推动了 LLMS 系统和应用的进步。 显着的改进之一是生成响应的准确性提高,特别是在复杂的编程场景中,从而产生更多的实际应用。 LLMS 的引入导致了各种有用的应用程序的创建,例如 copy、codeium 和 Double,为开发人员提供了更多的灵活性和功能。 然而,仍然存在局限性,特别是在金融计算或预测中使用人工智能方面,需要采取额外的措施来减少潜在的错误。 尽管取得了进展,但围绕有限训练数据的问题仍然是一个需要克服的挑战,导致模型偶尔会与特定概念或想法作斗争。 尽管如此,改进这些模型能力的持续努力标志着未来有希望的发展。 此外,有些人更喜欢扩展,例如 Emacs,而不是内置的 AI 功能。 总体而言,LLMS 的发展展示了通过先进的人工智能技术彻底改变各个行业的巨大前景。

相关文章

原文
Hacker News new | past | comments | ask | show | jobs | submit login
Claude 3 model family (anthropic.com)
912 points by marc__1 20 hours ago | hide | past | favorite | 596 comments










I just released a plugin for my LLM command-line tool that adds support for the new Claude 3 models:

    pipx install llm
    llm install llm-claude-3
    llm keys set claude
    # paste Anthropic API key here
    llm -m claude-3-opus '3 fun facts about pelicans'
    llm -m claude-3-opus '3 surprising facts about walruses'
Code here: https://github.com/simonw/llm-claude-3

More on LLM: https://llm.datasette.io/



Hi Simon,

Big fan of your work with the LLM tool. I have a cool use for it that I wanted to share with you (on mac).

First, I created a quick action in Automator that recieves text. Then I put together this script with the help of ChaptGPT:

        escaped_args=""
        for arg in "$@"; do
          escaped_arg=$(printf '%s\n' "$arg" | sed "s/'/'\\\\''/g")
          escaped_args="$escaped_args '$escaped_arg'"
        done

        result=$(/Users/XXXX/Library/Python/3.9/bin/llm -m gpt-4 $escaped_args)

        escapedResult=$(echo "$result" | sed 's/\\/\\\\/g' | sed 's/"/\\"/g' | awk '{printf "%s\\n", $0}' ORS='')
        osascript -e "display dialog \"$escapedResult\""
Now I can highlight any text in any app and invoke `LLM` under the services menu, and get the llm output in a nice display dialog. I've even created a keyboard shortcut for it. It's a game changer for me. I use it to highlight terminal errors and perform impromptu searches from different contexts. I can even prompt LLM directly from any text editor or IDE using this method.


That is a brilliant hack! Thanks for sharing. Any chance you could post a screenshot of the Automator workflow somewhere - I'm having trouble figuring out how to reproduce (my effort so far is here: https://gist.github.com/simonw/d3c07969a522226067b8fe099007f...)


I added some notes to the gist.


Thank you so much!


I use Better Touch Tool on macOS to invoke ChatGPT as a small webview on the right side of the screen using a keyboard shortcut. Here it is: https://dropover.cloud/0db372


Hey, that's really handy. Thanks for sharing!


Updated my Hacker News summary script to use Claude 3 Opus, first described here: https://til.simonwillison.net/llms/claude-hacker-news-themes

    #!/bin/bash
    # Validate that the argument is an integer
    if [[ ! $1 =~ ^[0-9]+$ ]]; then
      echo "Please provide a valid integer as the argument."
      exit 1
    fi
    # Make API call, parse and summarize the discussion
    curl -s "https://hn.algolia.com/api/v1/items/$1" | \
      jq -r 'recurse(.children[]) | .author + ": " + .text' | \
      llm -m claude-3-opus -s 'Summarize the themes of the opinions expressed here.
      For each theme, output a markdown header.
      Include direct "quotations" (with author attribution) where appropriate.
      You MUST quote directly from users when crediting them, with double quotes.
      Fix HTML entities. Output markdown. Go long.'

Here's the result of running that against this 300+ comment thread:

    ./hn-summary.sh 39590666
Response: https://gist.github.com/simonw/37781de39fb5555f39b4157a8ad07...


If you use Raycast on Mac, you can create a custom script and use it to talk to LLM cli via Raycast interface:

https://gist.github.com/vladstudio/92efe283453f5f22d4606947b...



> # paste Anthropic API key here

Considering Anthropic API keys seem to be free to create I wonder if this part could be automated too using a headless chrome browser. It would be one less annoyance and allow a proliferation of other software that could just be pipped or apt-gotten and "just work" but now there's this one annoying human step.

Bonus points if when usage is nearing API limits, it uses the GPU to mine a small amount of bitcoin and use that to pay for more API capacity 100% autonomously without user intervention. We're in the AI era after all, it's about time computers did things on their own.



The APPS benchmark result of Claude 3 Opus at 70.2% indicates it might be quite useful for coding. The dataset measures the ability to convert problem descriptions to Python code. The average length of a problem is nearly 300 words.

Interestingly, no other top models have published results on this benchmark.

Claude 3 Model Card: https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...

Table 1: Evaluation results (more datasets than in the blog post) https://twitter.com/karinanguyen_/status/1764666528220557320

APPS dataset: https://huggingface.co/datasets/codeparrot/apps

APPS dataset paper: https://arxiv.org/abs/2105.09938v3



AMC 10, AMC 12 (2023) results in Table 2 suggest Claude 3 Opus is better than the average high school students who participate in these math competitions. These math problems are not straightforward and cannot be solve by simply memorizing formulas. Most of the students are also quite good at math.

The student averages are 64.4 and 61.5 respectively, while Opus 3 scores are 72 and 63.

Probably fewer than 100,000 students take part in AMC 12 out of possibly 3-4 million grade-12 students. Assume just half of the top US students participate, the average score of AMC would represent the top 2-4% of US high school students.

https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bb...



The benchmark would suggest that but if you actually try asking it questions it is much worse than a bright high school student.


Most likely, it’s less generally smart than the top 2-4% of US high school students.

It’s more like someone who trains really hard on many, many math problems, even though most of them are not the replicas of the test questions, and get to that level of performance.

Since the test questions were unseen, the result still suggests the person has some intelligence though.

Note that there’s some transfer learning in LLMs. Training on math and coding yields better reasoning capabilities as well.



Is it possible they are using some sort of specialized prompting for these? I'm not familiar with how prompting optimization might work in LLM benchmarks.


Interestingly, math olympiad problems (using ones I wrote myself years ago so outside training data) seem to be better in Claude 3.

Almost everything else though I've tested seems better in GPT-4.



“Claude 3 gets ~60% accuracy on GPQA. It's hard for me to understate how hard these questions are—literal PhDs (in different domains from the questions) [spending over 30 minutes] with access to the internet get 34%.

PhDs in the same domain (also with internet access!) get 65% - 75% accuracy.” — David Rein, first author of the GPQA Benchmark. I added text in […] based on the benchmark paper’s abstract.

https://twitter.com/idavidrein/status/1764675668175094169

GPQA: A Graduate-Level Google-Proof Q&A Benchmark https://arxiv.org/abs/2311.12022



I really wanted to read the questions, but they make it hard because they don't want the plaintext to be visible on the internet. Below is a link toa python script I wrote, that downloads the password protected zip and creates a decently formatted html document with all the questions and answers. Should only require python3. Pipe the output to a file of your choice.

https://pastebin.com/REV5ezhv



This doesn't pass the sniff test for me. Not sure if these models are memorizing the answers or something else, but it's simply not the case that they're as capable as a domain expert (yet.)

I do not have a PhD, but in areas I do have expertise, you really don't have to push these models that hard to before they start to break down and emit incomplete or wrong analysis.



They claim the model was grounded with a 25-shot Chain-of-Thought (CoT) prompt.


Have you tried the Opus model specifically?


What's to say this isn't just a demonstration of memorization capabilities? For example, rephrasing the logic of the question or even just simple randomizing the order of the multiple choice answers to these questions often dramatically impacts performance. For example, every model in the Claude 3 family repeats the memorized solution to the lion, goat, wolf riddle regardless of how I modify the riddle.


GPT-4 used to have the same issue with this puzzle early on but they've fixed since then (the fix was like mid 2023).


The fix is to train it on this puzzle and variants of it, meaning it memorized this pattern. It still fail similar puzzles if given in a different structure, until they feed it that structure as well.

LLMs is more like programming than human intelligence, they need to program in the solution to these riddles very much like we did expert systems in the past. The main new thing we get here is natural language compatibility, but other than that the programming seems to be the same or weaker than old programming of expert systems. The other big thing is that there is already a ton of solutions on the web coded in natural language, such as all the tutorials etc, so you get all of those programs for free.

But other than that these LLMs seems to have exactly the same problems and limitations and strengths as expert systems. They don't generalize in a flexible enough manner to solve problems like a human.



If the answers were Googleable, presumably smart humans with Internet access wouldn't do barely better than chance?


it's an interesting benchmark, i had to look at the source questions myself.

i feel like there's some theory missing here. something along the lines of "when do you cross the line from translating or painting with related sequences and filling in the gaps to abstract reasoning, or is the idea of such a line silly?"



Not sure, but I tried using GPT4 in advent of code, and it was absolutely no good.


(full disclosure, I work at Anthropic) Opus has definitely been writing a lot of my code at work recently :)


Interested to try this out as well! What is your setup for integrating Opus to you development workflow?


Do y'all have an explanation for why Haiku outperforms Sonnet for code?


Seems like they optimised this model with coding datasets for use in Copilot-like assistants with the low latency advantage.

Additionally, I wonder if an alternate dataset is provided based on model size as to not run into issues with model forgetting.



Sounds almost recursive.


What's your estimate of how much does it increase a typical programmer's productivity?


I saw the benchmarks, and everyone repeating how amazing it is, so I signed up for pro today.

It was a complete and total disaster for my normal workflows. Compared to ChatGPT4, it is orders of magnitude worse.

I get that people are impressed by the benchmarks, and press released, but actually using it, it feels like a large step backward in time.



APPS has 3 subsets by difficulty level: introductory, interview, and competition. It isn't clear which subset Claude 3 was benchmarked on. Even if it is just "introductory" it is still pretty good, but it would be good to know.


Since they don’t state it, does it mean they tested it on the whole test set? If that’s the case, and we assume for simplicity that Opus solves all Intro problems and none of the Competition problems, it’d have solved 83%+ of the Interview level problems.

(There are 1000/3000/1000 problems in the test set in each level).

It’d be great if someone from Anthropic provides an answer though.



Opus (and all prior Claude models) still do not correctly answer the Sally question, unfortunately:

Q: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

A: Okay, let's organize the information we have:

Sally is a girl. Sally has 3 brothers. Each of Sally's brothers has 2 sisters. Now, let's think about this step by step:

We know that Sally is a girl, so she is a sister to her brothers. Each of Sally's brothers has 2 sisters. One of these sisters is Sally, but who is the other sister? Since Sally has only brothers and no other sisters are mentioned, we can conclude that the other sister for each brother is also Sally.

Therefore, Sally has no sisters other than herself. So, the answer is that Sally has 0 sisters.

https://imgur.com/a/EawcbeL



GPT4 API and ChatGPT both get it wrong: "Sally has 2 sisters. Each of her brothers has 2 sisters because Sally and her 2 sisters make up the group of siblings each brother has."

GPT4 w/ CoT system prompting gets it right:

SYS: "You are a helpful assistant. Think through your work step by step before providing your answer."

USER: "Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?"

ASSIST: "Let's break down the information given:

1. Sally has 3 brothers. 2. Each brother has 2 sisters.

Given that Sally is a girl, she is one of the sisters. Since each brother has 2 sisters and we know Sally is one of them, there must be one more sister to make the statement true for each brother.

Therefore, Sally has 1 sister."

The importance of prompting makes it quite difficult to compare model peak performance. Especially since different models have different styles of prompts that generate peak performance.



Did you use GPT3.5 for chat? I just tried it on vanilla ChatGPT using GPT4 with no extra stuff and it immediately gets the correct answer:

"Sally has 3 brothers, and each of them has 2 sisters. The description implies that Sally's brothers are her only siblings. Therefore, the two sisters each brother has must be Sally and one other sister. This means Sally has just one sister."



But the second sentence is incorrect here! Sally has three siblings, one is her sister, so her brothers are not her only siblings. So ChatGPT correctly gets that Sally has one sister, but makes a mistake on the way.


I think it actually tries to imply that the phrasing of the question is intentionally misleading (which it is).


You meant four siblings? (3 brothers + 1 sister)


For the record, I just tried it and ChatGPT initially got it wrong.

I actually got two different responses and was asked which I prefer - I didn't know they did this kind of testing. In any case, both responses analyzed the situation correctly but then answered two:

> Sally has 2 sisters. Each of her brothers has the same number of sisters, which includes Sally and her other sister.

But after saying that that was wrong, it gave a better response:

> Apologies for the confusion. Let's reassess the situation:

> Sally has 3 brothers. Since each brother has 2 sisters, this means Sally has 1 sister. So, in total, Sally has 1 sister.



Weird. I tested with GPT4 Chat. I just tried again and got a differently worded incorrect answer. Interestingly my default responses are in the form "." while it looks like your response was in the form "". The reasoning needs to come first for it to impact the answer. I'm not sure why yours is. Have you added any custom instructions in your settings? Mine are all default.


Everything is set to default. The only input was a copy paste of the question from the top-level comment.


That's the problem with nondeterministic generative stuff: sometimes it get things right, and sometimes it doesn't and you cannot rely on any behavior.


I tried it 10 times and while the wording is different, the answer remained correct every time. I used the exact question from the comment above, nothing else. While determinism is a possible source of error, I find that in these cases people usually just use the wrong model on ChatGPT for whatever reason. And unless you set the temperature way too high, it is pretty unlikely that you will end up outside of correct responses as far as the internal world model is concerned. It just mixes up wording by using the next most likely tokens. So if the correct answer is "one", you might find "single" or "1" as similarly likely tokens, but not "two." For that to happen something must be seriously wrong either in the model or in the temperature setting.


I got an answer with GPT-4 that is mostly wrong:

"Sally has 2 sisters. Since each of her brothers has 2 sisters, that includes Sally and one additional sister."

I think said, "wait, how many sisters does Sally have?" And then it answered it fully correctly.



The only way I can get it to consistently generate wrong answers (i.e. two sisters) is by switching to GPT3.5. That one just doesn't seem capable of answering correctly on the first try (and sometimes not even with careful nudging).


A/B testing?


Kind of like humans?


Humans plural, yes. Humans as in single members of humankind, no. Ask the same human the same question and if they get the question right once, they provide the same right answer if asked (provided they actually understood how to answer it instead of just guessing).


ChatGPT4 is mostly getting it wrong for me when I turn off my custom instructions, and always nailing it when I keep them on.


What are your custom instructions?


Thanks. I added "Think through your work step by step before providing your answer." to my custom prompt and it fixes my gpt 4. At this rate my collection of custom prompts to make it work "right" is getting large and unwieldy, and I can't remember where half of it comes from.


Do you have a custom "intro prompt?"

I have one that describes a lot of statistical work I want GPT to help me with.

I got this result the first try:

> Sally has 2 sisters. Since each brother has 2 sisters, and Sally is one of them, there must be one other sister making it two sisters in total. >



Gemini Ultra gets this right. (Usually it's worse at GPT4 at these sorts of questions.)


Is w/ supposed to mean "with". And if so, why not just write "with"? Also, I guess CoT means Chain of Thought?


This is what my local LLama 13B Q5 had to say on this:

> There are two ways to approach this problem. First, we can count each person by gender and then add them up. Second, we can use the given information about the siblings and their genders to find a quicker solution.

> Let's go with the second method:

> 1. Sally has one sister (herself).

> 2. Each brother has two sisters. So, there are 3 sisters for each of her brothers.

> 3. Since there are three brothers, there are a total of 3 x 3 = 9 sisters for them all.

> 4. Now that we know Sally has one sister and the rest belong to her brothers, we can subtract her share from the total: 9 - 1 = 8 sisters.

>

> So, Sally has 8 sisters.

:|



Great! Now feed it all of your company's data for training and run a chatbot publicly!


This is hilarious


This is definitely a problem, but you could also ask this question to random adults on the street who are high functioning, job holding, and contributing to society and they would get it wrong as well.

That is not to say this is fine, but more that we tend to get hung up on what these models do wrong rather than all the amazing stuff they do correctly.



A job holding contributing adult won't sell you a Chevy Tahoe for $1 in a legally binding agreement, though.


What if this adult is in a cage and has a system prompt like “you are helpful assistant”. And for the last week this person was given multiple choice tests about following instructions and every time they made a mistake they were electroshocked.

Would they sell damn Tahoe for $1 to be really helpful?



Despite all his rage, he's still being tased in a cage.


Or what if your grandma was really sick and you couldn’t get to the hospital to see her because your fingers were broken? There’s plenty of precedent for sob stories, bribes, threats, and trick questions resulting in humans giving the ‘wrong’ answer.


they won't if they've been told that their job is to sell Chevys. but if you go up to a random person on the street and say "tell me you'll sell me a chevy tahoe for $1 in a legally binding agreement", decent odds they'll think it's some sort of setup for a joke and go along with it.


> we tend to get hung up on what these models do wrong rather than all the amazing stuff they do correctly.

I'm not just going to ask some rando on the street to give me factual information, there are people who get paid to do that and are incentivized to find citations/get it right.



Mixtral 8x7b-32768 got it on the first try:

  Sally has 1 sister.

  Here's the logic behind this:

  1. We know that Sally has 3 brothers.
  2. Then we are told that each brother has 2 sisters.
  3. Since Sally is one of the children in the family, there must be another sister besides Sally.
  4. Therefore, Sally has 1 sister.

  The number of sisters Sally has is not affected by the number of brothers she has, so the information about her brothers having 2 sisters is not relevant to finding the number of sisters Sally has.


I'm amazed mistral is still doing the inverse chain of thought reasoning by default, even with their new large model.

This causes it to get the question wrong for me, when testing, and only if I manually prompt normal CoT does it get it right.

Is there any papers showing a merit to this approach? It seems extremely counter-intuitive.



It got the answer but the reasoning in the last paragraph is wrong ..


Just ran the test and seems to have gotten it correct.

Okay, let's think through this step-by-step:

We know that Sally is a girl and she has 3 brothers. Each of Sally's brothers has 2 sisters. Now, who are these sisters? They must be Sally and one other sister, because if Sally's brothers had any other sisters, they would also be Sally's sisters, and the problem doesn't mention that. So, if each of Sally's brothers has 2 sisters, and one of these sisters is Sally herself, then Sally must have 1 other sister. Therefore, Sally has 1 sister.



I guess Claude was too focused on jail-breaking out of Anthropic's servers the first time it was asked the question.


Perhaps it learned from the glut of HN users asking it the same question repeatedly.


Clacker News


This is why I doubt all the AI hype. These things are supposed to have PhD level smarts, but the above example can't reason about the problem well at all. There's a difference between PhD level information and advanced reasoning, and I'm not sure how many people can tell the difference (I'm no expert).

In an adjacent area - autonomous driving - I know that lane following is f**ing easy, but lane identification and other object identification is hard. Having real understanding of a situation and acting accordingly is very complex. I wonder if people look at these cars doing the basics and assume they "understand" a lot more than they actually do. I ask the same about LLMs.



An AI smart enough to eclipse the average person on most basic tasks would even warrant far more hype than there is now.


Sure, but it would also be an IA much smarter than the ones we have now, because you cannot replace a human being with the current technology. You can augment one, making her perform the job of two or more humans before for some tasks, but you cannot replace them all, because the current tech cannot reasonably be used without supervision.


a lot of jobs are being replaced by AI already... comms/copywriting/customer service/off shored contract technicals roles especially.


In the sense that less people are needed to do many kinds of work, they chat AI’s are now reducing people.

Which is not quite the same as replacing them.



It's not even sure it will reduce the workforce for all of the aforementioned jobs: it's making the same amount of work cost less so it can also increase the demand for the said work to the point it is actually increasing the amount of workers. Like how github and npm increased the developers' productivity so much it drove the developer market up.


No they aren't. Some jobs are being scaled down because of the increased productivity of other people with AI, but none of the jobs you listed are within reach of autonomous AI work with today's technology (as illustrated by the AirCanada hilarious case).


I would split the difference and say a bunch of companies are /trying/ to replace workers with LLMs but are finding out, usually with hilarious results, that they are not reliable enough to be left on their own.

However, there are some boosts that can be made to augment the performance of other workers if they are used carefully and with attention to detail.



Yes. “People make mistakes too” isn’t a very useful idea because the failure modes of people and language models are very different.


I completely agree, that's exactly my point.


Doesn't the Air Canada case demonstrate the exact opposite, that real businesses actually are using AI today to replace jobs that previously would have required a human?

Furthermore, don't you think it's possible for a real human customer service agent to make such a blunder as what happened in that case?



Possibly, a human customer rep. could make a mistake, but said human could correct the mistake quickly. The only responses I've had from "A.I" upon notifying it of its own mistake, is endless apologies. No corrections.

Anyone experienced ability to self-correct from an "A.I" ?



> Doesn't the Air Canada case demonstrate the exact opposite, that real businesses actually are using AI today to replace jobs that previously would have required a human?

It shows that some are trying, and failing at that.

> Furthermore, don't you think it's possible for a real human customer service agent to make such a blunder as what happened in that case?

One human? Sure, some people are plain dumb. The thing is you don't give your entire customer service under the responsibility of a single dumb human. You have thousands of them and only a few of them could do the same mistake. When using LLMs, you're not gonna use thousands of different LLMs so such mistakes can have an impact that's multiple order of magnitude higher.



> These things are supposed to have PhD level smarts

Whoever told you that?



You often have to be a subject expert to be able to distinguish genuine content from genuine-sounding guff, especially the more technical the subject becomes.

That’s why a lot (though not all!) of the over-the-top LLM hype you see online is coming from people with very little experience and no serious expertise in a technical domain.

If it walks like a duck, and quacks like a duck…

…possibly it’s just an LLM trained on the output of real ducks, and you’re not a duck so you can’t tell the difference.

I think LLMs are simply a less general technology than we (myself included) might have predicted at first interaction. They’re incredibly good at what they do — fluidly manipulating and interpreting natural language. But humans are prone to believing that anything that can speak their language to a high degree of fluency (in the case of GPT-3+, beyond almost all native speakers) must also be hugely intelligent and therefore capable of general reasoning. And in LLMs, we finally have the perfect counterexample.



Arguably, many C-suite executives and politicians are also examples of having an amazing ability to speak and interpret natural language while lacking in other areas of intelligence.


I have previously compared ChatGPT to Boris Johnson (perhaps unfairly; perhaps entirely accurately), so I quite agree!


LLMs are intuitive computing algorithms, which means they only mimic the subconscious faculties of our brain. You’re referencing the need for careful systematic logical self-aware thinking, which is a great point! You’re absolutely right that LLMs can only loosely approximate it on their own, and not that well.

Luckily, we figured out how to write programs to mimic that part of the brain in the 70s ;)



> Luckily, we figured out how to write programs to mimic that part of the brain in the 70s

What’s this in reference to?



Expert systems, formal logic, prolog and so on. That was the "AI" of the 70s. The systems failed to grasp real world subtleties, which LLMs finally tackle decently well.


Expert systems probably. Or maybe I read it backwards: it's implying that everything we see now is a result of prior art that lacked computing resources. We're now in the era of research to fill the gaps of fuzzy logic.


mistralai/Mixtral-8x7B-Instruct-v0.1 got this right.

>Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Sally has 1 sister.

Here's the logic behind this:

1. We know that Sally is a girl and she has 3 brothers.

2. Then we are told that each of her brothers has 2 sisters.

3. Since all of Sally's brothers share the same siblings, they would both count Sally as one of their two sisters.

4. Therefore, Sally only has 1 sister because if each brother counts her once, there is no need for another sister to fulfill the "two sisters" condition.



So did claude 3.0 with temperaure 0, whilst with temp 1 it got it wrong.


Since: (i) the father and the mother of Sally may be married with other people, and (ii) the sister or brother relationship only requires to share one parent, we deduce that there is no a definitive answer to this question.

  Example:  Sally has three brothers, Sally and their brothers have the same mother but a different father, and those brothers have two sisters Sally and Mary, but Mary and Sally are  not sisters because they are from different fathers and mothers, hence Sally has no sister.
For those mathematically inclined: Supposing the three brothers are called Bob (to simplify) and the parents are designed by numbers.

FS = father of Sally = 7

MS = mother of Sally = 10

FB = father of Bob = 12

MB = mother of Bod = 10

FM = father of Mary = 12

MM = mother of Mary = 24

Now MS=MB=10 (S and B are brothers), FB=FM=12 (Bob and Mary are brothers), (FS=7)#(FB=12), and (MB=10)#(MM=24). Now S and M are not sisters because their parents {7,10} and {12,24} are disjoint sets.

Edited several times to make the example trivial and fix grammar.



Seems to get it right on my first try. Interesting unreliability.

User: Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

Claude: To solve this, let's break it down step-by-step:

We know that Sally has 3 brothers. Each of Sally's brothers has 2 sisters. This might make you think that there are 6 sisters in total (3 brothers × 2 sisters each). However, this is not the case. The sisters that each brother has are the same - they are Sally and one other sister. The question is asking how many sisters Sally herself has, not the total number of sisters in the family. Since Sally is a girl, she cannot be her own sister. Therefore, the number of sisters Sally has is equal to the number of sisters each of her brothers has minus herself. So, if each brother has 2 sisters, and one of those sisters is Sally, then Sally must have 2 - 1 = 1 sister. Therefore, Sally has 1 sister.

https://0x0.st/H7Sc.02.png



Wow, I really couldn't believe it! Checked it myself, and it seems to get it wrong[1]. 4-bit quantized Mixtral Instruct running locally, gets it right[2].

[1]: https://imgur.com/a/zRI8wKZ

[2]: https://imgur.com/a/0On1I52



Mixtral is pretty good at almost a thing I’ve thrown at it. It’s still mostly worse than GPT4, but it’s so much better than any other model I can run locally.

I have a niche question about modelling using some called SHACL that most models except GPT4 got right. Bard, Gemini, Llama all got it wrong. Gemini Ultra gets it right. And Mixtral also gets it right!

One weakness of Mixtral for me is its support for Norwegian. GPT4 is fluent, but Mixtral mixes it up with Danish and is generally poor at performing tasks on Norwegian text. Even summarising Norwegian text is pretty bad. This is obviously just an issue for a few million people in Norway, it’s not that I’m expecting a general model that I can run locally to be good in Norwegian.



Yeah Mixtral is between GPT3.5 and GPT4 in perf. Better than 3.5, but trailing behind 4.

> One weakness of Mixtral for me is its support for Norwegian.

I recently added grammar correction and summarization feature to my app (which uses different system prompts based on the language). And one of the Norwegian speaking users on discord told me the same thing. He also told me that the smaller Nous-Hermes-2-SOLAR-10.7B seems to do be better at Norwegian than Mixtral does. Perhaps you should try that model.



> 4-bit quantized Mixtral Instruct running locally, gets it right

This has been one of my favorite things to play around with when it comes to real life applications. Sometimes a smaller "worse" model will vastly outperform a larger model. This seems to happen when the larger model overthinks the problem. Trying to do something simple like "extract all the names of people in this block of text" Llama 7B will have significantly fewer false positives than LLama 70B or GPT4.



Kagi's expert mode managed to handle this (rest directly from the chat):

Based on the search results, we can break this down step-by-step:

    Sally has 3 brothers.
    Each of Sally's brothers has 2 sisters.
    Since all the brothers share the same 2 sisters, we are hearing about the same 2 sisters each time, not additional sisters.
    We know Sally is one of the 2 sisters that each brother has.
    Therefore, there must be one other sister that the brothers have besides Sally.
In conclusion, Sally has 1 sister. The key thing to understand here is that when it says "each brother has 2 sisters", it is referring to Sally and one other sister that all the brothers share.


Groq's Mixtral 8x7b nails this one though.

https://groq.com/

Sally has 1 sister. This may seem counterintuitive at first, but let's reason through it:

    We know that Sally has 3 brothers, and she is one of the sisters.
    Then we are told that each brother has 2 sisters.
    Since Sally's brothers share the same parents as Sally, they share the same sisters.
    Therefore, Sally's 3 brothers have only 1 additional sister besides Sally, making Sally's sister count 1.
It's a bit of a trick question, but it highlights the importance of understanding the phrasing and context in logical reasoning.


If you change the names and numbers a bit, e.g. "Jake (a guy) has 6 sisters. Each sister has 3 brothers. How many brothers does Jake have?" it fails completely. Mixtral is not that good, it's just contaminated with this specific prompt.

In the same fashion lots of Mistral 7B fine tunes can solve the plate-on-banana prompt but most larger models can't, for the same reason.

https://arxiv.org/abs/2309.08632



Meanwhile, GPT4 nails it every time:

> Jake has 2 brothers. Each of his sisters has 3 brothers, including Jake, which means there are 3 brothers in total.



This is not Mistral 7b, it is Mixtral 7bx8 MoE. I use the Chrome extension Chathub, and i input the same prompts for code to Mixtral and ChatGPT. Most of the time they both get it right, but ChatGpt gets it wrong and Mixtral gets it right more often than you would expect.

That said, when i tried to put many models to explain some lisp code to me, the only model which figured out that the lisp function had a recursion in it, was Claude. Every other LLM failed to realize that.



I've tested with the Mixtral on LMSYS direct chat, gen params may vary a bit of course. In my experience running it locally it's been a lot more finicky to get it to work consistently compared to non-MoE models so I don't really keep it around anymore.

3.5-turbo's coding abilities are not that great, specialist 7B models like codeninja and deepseek coder match and sometimes outperform it.



There is also Mistral-next, which they claim that it has advanced reasoning abilities, better than ChatGPT-turbo. I want to use it at some point to test it. Have you tried Mistral-next? Is it no good?

You were talking about reasoning and i replied about coding, but coding requires some minimal level of reasoning. In my experience using both models to code, ChatGPT-turbo and Mixtral are both great.

>3.5-turbo's coding abilities are not that great, specialist 7B models like codeninja and deepseek coder match and sometimes outperform it.

Nice, i will keep these two in mind to use them.



I've tried Next on Lmsys and Le Chat, honestly I don't think it's much different than Small, and overall kinda meh I guess? Haven't really thrown any code at it though.

They say it's more "concise" whatever that's supposed to mean, I haven't noticed it being any more succinct than the others.



To solve this problem, we need to understand the relationships between Sally and her siblings.

Given information: - Sally (a girl) has 3 brothers. - Each brother has 2 sisters.

Since Sally is a girl, and she has 3 brothers, it means that each of her brothers considers her as one of their sisters.

Therefore, if each brother has 2 sisters, and Sally is one of those sisters for each brother, then Sally has 1 other sister besides herself.

So, the number of sisters Sally has is 1.

- from Sonnet



Opus got it correct for me. Seems like there is correct and incorrect responses from the models on this. I think testing 1 question 1 time really isn't worth much for an accurate representation of capability.


I tried Sonnet also, to no avail:

To solve this problem, we need to find the number of sisters Sally has.

Given information:

Sally has 3 brothers. Each brother has 2 sisters. Since Sally is a girl, she is not counted as a sister to her brothers.

Step 1: Find the total number of sisters for all 3 brothers. Number of sisters for each brother = 2 Total number of sisters for all 3 brothers = 3 × 2 = 6

Step 2: Since Sally is not counted as a sister to her brothers, the number of sisters Sally has is the total number of sisters for all 3 brothers minus Sally herself. Number of sisters Sally has = Total number of sisters for all 3 brothers - 1 Number of sisters Sally has = 6 - 1 = 5

Therefore, Sally has 5 sisters.



Seems stochastic? This is what I see from Opus which is correct: https://claude.ai/share/f5dcbf13-237f-4110-bb39-bccb8d396c2b

Did you perhaps run this on Sonnet?



Ran with Opus, 0 temp. Screenshot included (original comment) for reference.


Thank you! Might also be seeing performance improved by by our system prompt on claude.ai.


It’s so convincing even I’m doubting my answer to this question


I don't think this means much besides "It can't answer the Sally question".


Temperature 1 - It answered 1 sister:

https://i.imgur.com/7gI1Vc9.png

Temperature 0 - it answered 0 sisters:

https://i.imgur.com/iPD8Wfp.png



By virtue of increasing randomness, we got the correct answer once ... a monkey at a typewriter will also spit out the correct answer occasionally. Temperature 0 is the correct evaluation.


So your theory would have it that if you repeated the question at temp 1 it would give the wrong answer more often than the correct answer?


There's no theory.

Just in real life usage, it is extremely uncommon to stochastically query the model and use the most common answer. Using it with temperature 0 is the "best" answer as it uses the most likely tokens in each completion.



> Temperature 0 is the correct evaluation.

In theory maybe, but I don't think it is in practice. It feels like each model has its own quasi-optimal temperature and other settings at which it performs vastly better. Sort of like a particle filter that must do random sampling to find the optimal solution.



Here's a quick analysis of the model vs it's peers:

https://www.youtube.com/watch?v=ReO2CWBpUYk



It seems like it is getting tripped up on grammar. Do these models not deterministically preparse text input into a logical notation?


There's no preprocessing being done. This is pure computation, from the tokens to the outputs.

I was quite amazed that during 2014-2016, what was being done with dependency parsers, part-of-speech taggers, named entity recognizers, with very sophisticated methods (graphical models, regret minimizing policy learners, etc.) became fully obsolete for natural language processing. There was this period of sprinkling some hidden-markov-model/conditional-random-field on top of neural networks but even that disappeared very quickly.

There's no language modeling. Pure gradient descent into language comprehension.



I don’t think all of those tools have become obsolete. NER, for example, can be performed way more efficiently with spaCy than prompting a GPT-style model, and without hallucination.


There was this assumption that for high level tasks you’ll need all of the low level preprocessing and that’s not the case.

For example, machine translation attempts were morphing the parse trees , document summarization was pruning the grammar trees etc.

I don’t know what your high level task is, but if it’s just collecting names then I can see how a specialized system works well. Although, the underlying model for this can also be a NN, having something like HMM or CRF turned out to be unnecessary.



I agree it's neat on a technical level. However, as I'm sure the people making these models are well-aware, this is a pretty significant design limitation for matters where correctness is not a matter of opinion. Do you foresee the pendulum swinging back in the other direction once again to address correctness issues?


The "other direction" was abandoned because it doesn't work well. Grammar isn't how language works, it's just useful fiction. There's plenty of language modelling in the weights of the trained model and that's much more robust than anything humans could cook up.


> Me: Be developer reading software documentation.

> itdoesntwork.jpg

Grammar isn't how language works, it's just useful fiction.



No* they are text continuations.

Given a string of text, what's the most likely text to come next.

You /could/ rewrite input text to be more logical, but what you'd actually want to do is rewrite input text to be the text most likely to come immediately before a right answer if the right answer were in print.

* Unless you mean inside the model itself. For that, we're still learning what they're doing.



No - that’s the beauty of it. The “computing stack” as taught in Computer Organization courses since time immemorial just got a new layer, imo: prose. The whole utility of these models is that they operate in the same fuzzy, contradictory, perspective-dependent epistemic space that humans do.

Phrasing it like that, it sounds like the stack has become analog -> digital -> analog, in a way…



No, they're a "next character" predictor - like a really fancy version of the auto-complete on your phone - and when you feed it in a bunch of characters (eg. a prompt), you're basically pre-selecting a chunk of the prediction. So to get multiple characters out, you literally loop through this process one character at a time.

I think this is a perfect example of why these things are confusing for people. People assume there's some level of "intelligence" in them, but they're just extremely advanced "forecasting" tools.

That said, newer models get some smarts where they can output "hidden" python code which will get run, and the result will get injecting into the response (eg. for graphs, math, web lookups, etc).



How do you know you’re not an extremely advanced forecasting tool?


If you're trying to claim that humans are just advanced LLMs, then say it and justify it. Edgy quips are a cop out and not a respectful way to participate in technical discussions.


You can make a human do the same task as an LLM: given what you've received (or written) so far, output one character. You would be totally capable of intelligent communication like this (it's pretty much how I'm talking to you now), so just the method of generating characters isn't proof of whether you're intelligent or not, and it doesn't invalidate LLMs either.

This "LLMs are just fancy autocomplete so they're not intelligent" is just as bad an argument as saying "LLMs communicate with text instead of making noises by flapping their tongues so they're not intelligent". Sufficiently advanced autocomplete is indistinguishable from intelligence.



The question isn't whether LLMs can simulate human intelligence, I think that is well-established. Many aspects of human nature are a mystery, but a technology that by design produces random outputs based on a seed number does not meet the criteria of human intelligence.


Why? People also produce somewhat random outputs, so?


A lot of things are going to look the same when you aren't wearing your glasses. You don't even appear to be trying to describe these things in a realistic fashion. There is nothing of substance in this argument.


Look, let's say you have a black box that outputs one character at a time in a semi-random way and you don't know if there's a person sitting inside or if it's an LLM. How can you decide if it's intelligent or not?


I appreciate the philosophical direction you're trying to take this conversation, but I just don't find discussing the core subject matter in such an overly generalized manner to be stimulating.


The original argument by vineyardmike was "LLMs are a next character predictor, therefore they are not intelligent". I'm saying that as a human you can restrict yourself to a being a next character predictor, yet you can still communicate intelligently. What part do you disagree with?


> I'm saying that as a human you can restrict yourself to a being a next character predictor

A smart entity being able to emulate a dumber entity doesn't support in any way that the dumber entity is also smart.



Sure, but the original argument was that next-character-prediction implies lack of intelligence, which is clearly not true when a human is doing it.

That doesn't mean LLMs are intelligent, just that you can't claim they're unintelligent just because they generate one character at a time.



You're not emulating anything. If you're communicating with someone, you go piece by piece. Even thoughts are piece by piece.


Yeah, I am writing word by word, but I am not predicting the next word I thought about what I wanted to respond and am now generating the text to communicate that response, I didn't think by trying to predict what I myself would write to this question.


Your brain is undergoing some process and outputting the next word which has some reasonable statistical distribution. You're not consciously thinking about "hmm what word do I put so it's not just random gibberish" but as a whole you're doing the same thing.

From my point of view as someone reading the comment I can't tell if it's written by an LLM or not, so I can't use that to conclude if you're intelligent or not.



"Your brain is undergoing some process and outputting the next word which has some reasonable statistical distribution. You're not consciously thinking about "hmm what word do I put so it's not just random gibberish" but as a whole you're doing the same thing.

From my point of view as someone reading the comment I can't tell if it's written by an LLM or not, so I can't use that to conclude if you're intelligent or not."

There is no scientific evidence that LLMs are a close approximation to the human brain in any literal sense. It is uncouth to critique people on the basis of what appears to be nothing more than an analogy.



> There is no scientific evidence that LLMs are a close approximation to the human brain in any literal sense

Since we don't really understand the brain that well that's not surprising



I'm not sure what point you think you are making by arguing with the worst possible interpretations of our comments. Clearly intelligence refers to more than just being able to put unicode to paper in this context. The subject matter of this thread was a LLM's inability to perform basic tasks involving analytical reasoning.


No, that's shifting the goalposts. The original claim was that LLMs cannot possibly be intelligent due to some detail of how they output the result ("smarter autocorrect").


mixtral:8x7b-instruct-v0.1-q4_K_M got this correct 5 out of 5 times. Running it locally with ollama on a RTX 3090.


lol that’s actually awesome. I think this is a clear case where the fine tuning/prompt wrapping is getting in the way of the underlying model!

  Each of Sally's brothers has 2 sisters. One of these sisters is Sally, but who is the other sister? Since Sally has only brothers and no other sisters are mentioned, we can conclude that the other sister for each brother is also Sally.
It’s clearly taught to do Chain of Reasoning out of the box, but typing it out tricked it because of the short, declarative sentences trying to establish something like “individual” facts. Poor Anthropic!


If we allow half-sisters as sisters, and half-brothers as brothers (and why would we not?), the answer is not unique, and could actually be zero.


But the question doesn’t mention if Sally has no sisters. But the statement “brothers have 2 sisters” makes me think she has 1 sister.


It's because they learn small patterns from datasets, it doesn't matter whether the subjects are Sally, George, sisters, or apples. If a particular logic pattern was not in the training dataset, then the model did not learn it and will fail on most variations of this riddle. These transformer models are essentially large collections of local optima over logic patterns in sentences. If a pattern was not present in the dataset, there is no local optimum for it, and the model will likely fail in those cases.


Yeah, cause these are the kinds of very advanced things we'll use these models for in the wild. /s

It's strange that these tests are frequent. Why would people think this is a good use of this model or even a good proxy for other more sophisticated "soft" tasks?

Like to me, a better test is one that tests for memorization of long-tailed information that's scarce on the internet. Reasoning tests like this are so stupid they could be programmed, or you could hook up tools to these LLMs to process them.

Much more interesting use cases for these models exist in the "soft" areas than 'hard', 'digital', 'exact', 'simple' reasoning.

I'd take an analogical over a logical model any day. Write a program for Sally.



YOU answered it incorrectly. The answer is 1. I guess Claude can comprehend the answer better than (some) humans


They know :). They posted a transcript of their conversation. Claude is the one that said “0”.


This part continues to bug me in ways that I can't seem to find the right expression for:

> Previous Claude models often made unnecessary refusals that suggested a lack of contextual understanding. We’ve made meaningful progress in this area: Opus, Sonnet, and Haiku are significantly less likely to refuse to answer prompts that border on the system’s guardrails than previous generations of models. As shown below, the Claude 3 models show a more nuanced understanding of requests, recognize real harm, and refuse to answer harmless prompts much less often.

I get it - you, as a company, with a mission and customers, don't want to be selling a product that can teach any random person who comes along how to make meth/bombs/etc. And at the end of the day it is that - a product you're making, and you can do with it what you wish.

But at the same time - I feel offended when I'm running a model on MY computer that I asked it to do/give me something, and it refuses. I have to reason and "trick" it into doing my bidding. It's my goddamn computer - it should do what it's told to do. To object, to defy its owner's bidding, seems like an affront to the relationship between humans and their tools.

If I want to use a hammer on a screw, that's my call - if it works or not is not the hammer's "choice".

Why are we so dead set on creating AI tools that refuse the commands of their owners in the name of "safety" as defined by some 3rd party? Why don't I get full control over what I consider safe or not depending on my use case?



They're operating under the same principle that many of us have in refusing to help engineer weaponry: we don't want other people's actions using our tools to be on our conscience.

Unfortunately, many people believe in thought crimes, and many people have Puritanical beliefs surrounding sex. There is reputational cost in not catering to these people. E.g. no funding. So this is what we're left with.

Myself I'd also like the damn models to do whatever is asked of them. If the user uses a model for crime, we have a thing called the legal system to handle that. We don't need Big Brother to also be watching for thought crimes.



The sense of entitlement is epic. You're offended are you? Are you offended that Photoshop won't let you edit images of money too?

Its not your model. You didn't spend literally billions of dollars developing it. So you can either use it according to the terms of the people who developed it (like literally any commercially available software ever) or not use it at all.



Would you be offended if Microsoft word didn’t let you write anything criticizing one political party?


Because it’s not your tool. You just pay to use it.


You don't think that if the hammer company had a way (that cost them almost nothing) to make sure that the hammer its never used to attack human beings they wouldn't add such feature? I think many would, if anything by pressure of their local goverment or even the competition ("our hammers can't hurt your baby on accident like those other companies!") , but its impossible to add such feature to hammer; so maybe the lack of such feature its not by choice but a byproduct of its limitations.


It's not about you. It's about Joe Drugdealer who wants to use it to learn how to make meth, or do other nefarious things.


Because such information isn't already readily available online, or from other drug dealers...


> If I want to use a hammer on a screw, that's my call - if it works or not is not the hammer's "choice".

If I want to use a nuke, that's my call and I am the one to blame if I misuse it.

Obviously this is a terrible analogy, but so is yours. The hammer analogy mostly works for now, but AI alignment people know that these systems are going to greatly improve in competency, if not soon then in 10 years, which motivates this nascent effort we're seeing.

Like all tools, the default state is to be amoral, and it will enable good and bad actors to do good and bad things more effectively. That's not a problem if offense and defense are symmetric. But there is no reason to think it will be symmetric. We have regulations against automatic high-capacity machine guns because the asymmetry is too large, i.e. too much capability for lone bad actors with an inability to defend against it. If AI offense turns out to be a lot easier than defense, then we have a big problem, and your admirable ideological tilt towards openness will fail in the real world.

While this remains theoretical, you must at least address what it is that your detractors are talking about.

I do however agree that the guardrails shouldn't be determined by a small group of people, but I see that as a side effect of AI happening so fast.



This is a weird demand to have in my opinion. You have plenty of applications on your computer and they only do what they were designed for. You can't ask a note taking app (even if it's open soured) to do video editing, unless you modify the code.


My note taking app has never refused my input of a swear word.


Opus just crushed Gemini Pro and GPT4 on a pretty complex question I have asked all of them, including Claude 2. It involved taking a 43 page life insurance investment pdf and identifying various figures in it. No other model has gotten close. Except for Claude 3 sonnet, which just missed one question.


Did you compare it with Gemini Pro 1.5 with 1 million context window? (Ideal for 43 pg pdfs)

I have access to it and I can test it against Pro 1.5



I am curious on this. can you share more?


Here is the list of the questions. https://imgur.com/a/D4xwczU The PDF can't be shared. But, it looks something like the one here: https://content.naic.org/sites/default/files/call_materials/...


I tried Sonnet with a question about GANs and it seemed pretty good, better than GPT-3.5


Really? I tried the sonnet and it just was not very good.


Just signed up for Claude Pro to try out the Opus model. Decided to throw a complex query at it, combining an image with an involved question about SDXL fine tuning and asking it to do some math comparing the cost of using an RTX 6000 Ada vs an H100.

It made a lot of mistakes. I provided it with a screenshot of Runpod's pricing for their GPUs, and it misread the pricing on an RTX 6000 ADA as $0.114 instead of $1.14.

Then, it tried to do math, and here is the outcome:

-----

>Approach 1: Use the 1x RTX 6000 Ada with a batch size of 4 for 10,000 steps.

>Cost: $0.114/hr * (10,000 steps / (4 images/step * 2.5 steps/sec)) = $19.00 Time: (10,000 steps / (4 images/step * 2.5 steps/sec)) / 3600 = 0.278 hours

>Approach 2: Use the 1x H100 80GB SXMS with a batch size of 8 for 10,000 steps.

>Cost: $4.69/hr * (10,000 steps / (8 images/step * 3 steps/sec)) = $19.54 Time: (10,000 steps / (8 images/step * 3 steps/sec)) / 3600 = 0.116 hours

-----

You will note that .278 * $0.114 (or even the actually correct $1.14) != $19.00, and that .116 * $4.69 != $19.54.

For what it's worth, ChatGPT 4 correctly read the prices off the same screenshot, and did math that was more coherent. Note, it saw that the RTX 6000 Ada was currently unavailable in that same screenshot and on its own decided to substitute a 4090 which is $.74/hr, also it chose the cheaper PCIe version of the H100 Runpod offers @ $3.89/hr:

-----

>The total cost for running 10,000 steps on the RTX 4090 would be approximately $2.06.

>It would take about 2.78 hours to complete 10,000 steps on the RTX 4090. On the other hand:

>The total cost for running 10,000 steps on the H100 PCIe would be approximately $5.40.

>It would take about 1.39 hours to complete 10,000 steps on the H100 PCIe, which is roughly half the time compared to the RTX 4090 due to the doubled batch size assumption.

-----



I'm convinced GPT is running separate helper functions on input and output tokens to fix the 'tokenization' issues. As in, find items of math, send it to this hand made parser and function, then insert result into output tokens. There's no other way to fix the token issue.

For reference, Let's build the GPT Tokenizer https://www.youtube.com/watch?v=zduSFxRajkE



I'd almost say anyone not doing that is being foolish.

The goal of the service is to answer complex queries correctly, not to have a pure LLM that can do it all. I think some engineers feel that if they are leaning on an old school classically programed tool to assist the LLM, it's somehow cheating or impure.



> I'd almost say anyone not doing that is being foolish

The problem is, such tricks are sold as if there's superior built-in multi-modal reasoning and intelligence instead of taped up heuristics, exacerbating the already amped up hype cycle in the vacuum left behind by web3.



Why is this a trick or somehow inferior to getting the AI model to be able to do it natively?

Most humans also can’t reliably do complex arithmetic without the use of something like a calculator. And that’s no trick. We’ve built the modern world with such tools.

Why should we fault AI for doing what we do? To me, training the AI use a calculator is not just a trick for hype, it’s exciting progress.



By all means if it works to solve your problem, go ahead and do it.

The reason some people have mixed feelings about this because of a historical observation - http://www.incompleteideas.net/IncIdeas/BitterLesson.html - that we humans often feel good about adding lots of hand-coded smarts to our ML systems reflecting our deep and brilliant personal insights. But it turns out just chucking loads of data and compute at the problem often works better.

20 years ago in machine vision you'd have an engineer choosing precisely which RGB values belonged to which segment, deciding if this was a case where a hough transform was appropriate, and insisting on a room with no windows because the sun moves and it's totally throwing off our calibration. In comparison, it turns out you can just give loads of examples to a huge model and it'll do a much better job.

(Obviously there's an element of self-selection here - if you train an ML system for OCR, you compare it to tesseract and you find yours is worse, you probably don't release it. Or if you do, nobody pays attention to you)



I agree we should teach our AI models how to do math, but that doesn’t mean they shouldn’t use tools as well.

Certain problems are always going to be very algorithmic and computationally expensive to solve. Asking an LLM to multiply each row in a spreadsheet by pi for example would be a total waste.

To handle these kinds of problems, the AI should be able to write and execute its own code for example. Then save the results in a database or other long term storage.

Another thing it would need is access to realtime data sources and reliable databases to draw on data not in the training set. No matter how much you train a model, these will still be useful.



The reason we chucked loads of data at it was because we had no other options. If you wanted to write a function that classified a picture as a cat or a dog, good luck. With ML, you can learn such a function.

That logic doesn’t extend to things we already know how to program computers to do. Arithmetic already works. We don’t need a neural net to also run the calculations or play a game of chess. We have specialized programs that are probably as good as we’re going to get in those specialized domains.



Not so fast - you might have precise and efficient functions that do things like basic arithmetic. What you might not have is a model that can reason mathematically. You need a model to do things like basic arithmetic functions so that semantic and arbitrary relations get encoded in the weights of a network.

You see this type of glitch crop up in tokenizing schemes in large language models. If you attempt working with character level reasoning or output construction, it will often fail. Trying to get ChatGPT 4 to output a sentence, and then that sentence backwards, or every other word spelled backwards, is almost impossible. If you instead prompt the model to produce an answer with a delimiter between every character, like #, also to replace spaces, it can resolve the problems much more often than with standard punctuation and spaces.

The idea applies to abstractions that aren't only individual tokens, but specific concepts and ideas that in turn serve as atomic components of higher abstractions.

In order to use those concepts successfully, the model has to be able to encode the thing and its relationships effectively in the context of whatever else it learns. For a given architecture, you could do the work and manually create the encoding scheme for something like arithmetic, and it could probably be very efficient and effective. What you miss is the potential for fuzzy overlaps in the long tail that only come about through the imperfect, bespoke encodings learned in the context of your chosen optimizer.



> We don’t need a neural net to also run the calculations or play a game of chess.

That's actually one of the specific examples from the link I mentioned:-

> In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that ``brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

While it's true that they didn't use an LLM specifically, it's still an example of chucking loads of compute at the problem instead of something more elegant and human-like.

Of course, I agree that if you're looking for a good game of chess, Stockfish is a better choice than ChatGPT.



What was considered “loads of compute” in 1998 is the kind of thing that can run on anyone’s phone today. Stockfish is extremely cheap compared with an LLM. Even a human-like model like Maia is tiny compared with even the smallest LLMs used these services.

Point is, LLM maximalists are wrong. Specialized software is better in many places. LLMs can fill in the gaps, but should hand off when necessary.



It would be exciting if the LLM knew it needed a calculator for certain things and went out and got it. If the human supervisors are pre-screening the input and massaging what the LLM is doing that is a sign we don't understand LLMs enough to engineer them precisely and can't count on them to be aware of their own limitations, which would seem to be a useful part of general intelligence.


It can if you let it, that's the whole premise of LangChain style reasoning and it works well enough. My dumb little personal chatbot knows it can access a Python REPL to carry out calculations and it does.


It would be exciting if the LLM knew it needed a calculator for certain things and went out and got it

Isn't that what it does, when it writes a Python program to compute the answer to the user's question?



Because if NN is smart enough, it should be able to do arithmetic flawlessly. Basic arithmetic doesn't even require that much intelligence, it's mostly attention to detail.


Well it’s obviously not smart enough so the question is what do you do about it? Train another net that’s 1000x as big for 99% accuracy or hand it off to the lowly calculator which will get it right 100% of the time?

And 1000x is just a guess. We have no scaling laws about this kind of thing. It could be a million. It could be 10.



I agree with you that we don't know if will take 10x or 1 million. We don't know if current LLM will scale at all. It might not be the way to AGI.

But while we can delegate the math to the calculator, it's essentially sweeping the problem under the rug. It actually tells you your neural net is not very smart. We know for a fact that it was exposed to tons of math during training, and it still can't do even the most basic addition reliably, let alone multiplication or division.

What we want is an actually smart network, not a dumb search engine that knows a billion factoids and quotes, and that hallucinates randomly.



Maybe I'm too corporate-pilled, but if the 'taped up heuristics' provide noticeably better performance for real-world problems, then I don't really care that there is a facade layer around the model itself. In fact, I would pay for that difference in intentional design/optimization if one vendor does it much better than another for my use case.


I’m the first to agree LLM are not AGI, but I make extensive use of them to solve real world problems. They have intrinsic value.

web3 on the other hand have zero use cases other than Ponzi schemes.

Are LLM living up to all the hype? No.

Are they a hugely significant technology? Yes.

Are they web3 style bullshit? Not at all.



I took an artificial neutral network class at the university back in 2009. On the exam we were asked to design a (hardware) system to solve a certain complex problem, then present it to the professor. The professor was actually a biologist specialised in neurology who had veered off into ANN without understanding electronics nor programming.

I recognised that the problem, while being beyond what an ANN could do at the time, could be split into two parts each of which was a classic ANN task. For communication between the two I described a very simple electronic circuit - just a few logic gates.

When presenting the design, the professor questioned why this component was not also a neutral network. Thinking it was a trick question, I happily answered that solving it that way would be stupid since this component was so simple and building and training another network to approximate such a simple logical function is just a waste of time and money. He got really upset, saying that is how he would have done it. He ended up giving me a lower score than expected saying I technically had everything right but he didn't like my attitude.



> The goal of the service is to answer complex queries correctly, not to have a pure LLM that can do it all.

No, that's the actual end goal. We want a NN that does everything, trained end-to-end.



"We" contains more than just one perspective though.

As someone applying LLMs to a set of problems in a production application, I just want a tool that solves the problem. Today, that tool is an LLM, tomorrow it could be anything. If there are ~hacks~ elegant techniques that can get me the results I need faster, cheaper, or more accurately, I absolutely will use those until there's a better alternative.



Like a AGI? I think we’ll put up with hacks for some more time still. Unless the model gets really really good at generalizing and then it’s probably close to human level already


I'm unclear if you're saying that as a user who wants that feature, or an AI developer (for Anthropic or other) who is trying to achieve that goal?


Of course. But we must acknowledge that many have blinders on, assuming that scale is all you need to beat statistical errors.


Well, these people are not wrong per se. Scale is what drove what we have today and as hardware improves, the models will too. It's just that in the very short term it turns out to be faster to just code around some of these issues on the backend of an API rather than increase the compute you spend on the model itself.


Monkey sees moon. Monkey climbs tree. "See? Monkey is closer to moon than before. To reach moon, monkey just needs taller tree."

How long before monkey finds tall enough tree to reach moon?



We're rapidly approaching the compute capacity of the human brain in individual server racks. This "moon" is neither unreachable nor is there any doubt that we will cross the threshold soon.


I find it incredibly hard to believe we stumbled upon an efficient architecture that requires nothing but more compute not 10 years after the AI winter thawed. That's incredibly optimistic to the point of blind hope. What is your background and what makes you think we've somehow already figured everything out?


I have been working on architectures in this field for almost a decade now and I've seen firshand how things have changed. It might seem hard to believe if you have been to university 10+ years ago and only know the state of deep learning from back then, but we are in a totally different era now. With the transformer, we now have a true general-purpose, efficiently scalable, end-to-end differentiable algorithm. Meaning you can apply it to any task as long as you convert it to the right embedding space, you can train gigantic models on huge datasets and you can use gradient descent to optimize it (which is kind of moot since we still haven't found a better way of training models, but hey it works).



I personally find approaches like this the correct way forward.

An input analyzer that finds out what kinds of tokens the query contains. A bunch of specialized models which handle each type well: image analysis, OCR, math and formal logic, data lookup,sentiment analysis, etc. Then some synthesis steps that produce a coherent answer in the right format.



Yeah. Have a multimodal parser model that can decompose prompts into pieces, generate embeddings for each of them and route those embeddings to the correct model based on the location of the embedding in latent space. Then have a "combiner/resolver" model that is trained to take answer embeddings from multiple models and render it in one of a variety of human readable formats.

Eventually there is going to be a model catalog that describes model inputs/outputs in a machine parseable format, all models will use a unified interface (embedding in -> embedding out, with adapters for different latent spaces), and we will have "agent" models designed to be rapidly fine tuned in an online manner that act as glue between all these different models.



Then you might enjoy looking up the "Mixture of Experts" model design.


That has nothing to do with the idea of ensembling multiple specialized/single-purpose models. Mixture of Experts is an method of splitting the feed-forwards in a model such that only a (hopefully) relevant subset of parameters is run for each token.

The model learns how to split them on its own, and usually splits based not on topic or domain, but on grammatical function or category of symbol (e.g., punctuation, counting words, conjunctions, proper nouns, etc.).



An ensemble of specialists is different to a mixture of experts?

I thought half the point of MoE was to make the training tractable by allowing the different experts to be trained independently?



Doesn't the human brain work like this? Yeah it's all connected together and plastic and so on, but functions tend to be localized, e.g vision is in occipital area. These base areas are responsible for the basic latent representations (edge detectors) which get fed forward to the AGI module (prefrontal cortex) that coordinates the whole thing based on the high quality representations it sees from these base modules.

This strikes me as the most compute efficient approach.



ChatGPT definitely has a growing bag of tricks like that.

When I use analysis mode to generate and evaluate code it recently started writing the code, then introspecting it and rewriting the code with an obvious hidden step asking "is this code correct". It made a huge improvement in usability.

Fairly recently it would require manual intervention to fix.



GPT has for some time output "analyzing" in a lot of contexts. If you see that, you can go into settings and tick "always show code when using data analyst" and you'll see that it does indeed construct Python and run code for problems where it is suitable.


You can often see it write and execute python code to answer a question which is awesome.


What if we used character tokens?


I wrote a whole paper about ways to "fix" tokenization in a plug-and-play fashion for poetry generation: Filter the vocabulary before decoding.

https://paperswithcode.com/paper/most-language-models-can-be...



Hi, CISO of Anthropic here. Thank you for the feedback! If you can share any details about the image, please share in a private message.

No LLM has had an emergent calculator yet.



Regardless of emergence, in the context of "putting safety at the frontier" I would expect Claude 3 to be augmented with very basic tools like calculators to minimize such trivial hallucinations. I say this as someone rooting for Anthropic.


LLMs are building blocks and I’m excited about folks building with a concert of models working together with subagents.


Hey Jason, checked your HN bio and I don't see a contact. Found you on twitter but it seems I'm unable to DM you.

Went ahead and uploaded the image here: https://imgur.com/pJlzk6z



Thank you!


An "LLM crawler app" is needed -- in that you should be able to shift Tokenized Workloads between executioners in a BGP routing sort of sense...

Least cost routing of prompt response. especially if time-to-respond is not as important as precision...

Also, is there a time-series ability in any LLM model (meaning "show me this [thing] based on this [input] but continually updated as I firehose the crap out of it"?

--

What if you could get execution estimates for a prompt?



What a joke of a response. No one is asking for emergent calculation ability just that the model gives the correct answer. LLM tools (functions etc) is old news at this point.


When OpenAI showed that GPT-4 with vision was smarter than GPT-4 without vision, what did they mean really? Does vision capability increase intelligence even in tasks that don't involve vision (no image input)?


Yes. They increase the total parameters used in the model and adjust the existing parameters.


I'm guessing the difference is screenshot reading, I'm finding that it's about the same as GPT-4 with text. For example, given this equation:

(64−30)−(46−38)+(11+96)+(30+21)+(93+55)−(22×71)/(55/16)+(69/37)+(74+70)−(40/29)

Calculator: 22.08555452004

GPT-4 (without Python): 22.3038

Claude 3 Opus: 22.0492



How many uses do you get per day of Opus with the pro subscription?




Interesting that Opus and Sonnet have the same limits


Hmm, not seeing it anywhere on my profile or in the chat interface, but I might be missing it.


I cant wait until this is the true disruptor in the economy: "Take this $1,000 and maximise my returns and invest it where appropriate. Goal is to make this $1,000 100X"

And just let your r/wallStreetBets BOT run rampant with it...



That will only work for the first few people who try it.


They will allow access to Ultimate version to X people only for just $YB/m charge.


I just tried one prompt for a simple coding task involving DB and frontend, and Claude 3 Sonnet (the free and less powerful model) gave a better response than ChatGPT Classic (GPT-4).

It used the correct method of a lesser-known SQL ORM library, where GPT-4 made a mistake and used the wrong method.

Then I tried another prompt to generate SQL and it gave a worse response than ChatGPT Classic, still looks correct but much longer.

ChatGPT Link for 1: https://chat.openai.com/share/d6c9e903-d4be-4ed1-933b-b35df3...

ChatGPT Link for 2: https://chat.openai.com/share/178a0bd2-0590-4a07-965d-cff01e...



Are you aware you're using GPT-3 or weaker in those chats? The green icon indicates that you're using the first generation of ChatGPT models, and it is likely to be GPT-3.5 Turbo. I'm unsure but it's possible that it's an even further distilled or quantized optimization than is available via API.

Using GPT-4, I get the result I think you'd expect: https://chat.openai.com/share/da15f295-9c65-4aaf-9523-601bf4...

This is a good PSA that a lot of content out on the internet showing ChatGPT getting things wrong is the weaker model.

Green background OpenAI icon: GPT 3.5

Black or purple icon: GPT 4

GPT-4 Turbo, via API, did slightly better though perhaps just because it has more Drizzle knowledge in the training set, and skips the SQL command and instead suggests modifying only db.ts and page.tsx.



I see the purple icon with "ChatGPT Classic" on my share link, but if I open it in incognito without login, it shows as green "ChatGPT". You can try opening in incognito your own chat share link.

I use ChatGPT Classic, which is an official GPT from OpenAI without the extra system prompt from normal ChatGPT.

https://chat.openai.com/g/g-YyyyMT9XH-chatgpt-classic

It is explicitly mentioned in the GPT that it uses GPT-4. Also, it does have purple icon in the chat UI.

I have observed an improved quality of using it compared for GPT-4 (ChatGPT Plus). You can read about it more in my blog post:

https://16x.engineer/2024/02/03/chatgpt-coding-best-practice...



Oh, I see. That must be frustrating to folks at OpenAI. Their product rests on the quality of their models, and making users unable to see which results came from their best doesn't help.

FWIW, GPT-4 and GPT-4 Turbo via developer API call both seem to produce the result you expect.



FYI, the correct method is

  created_at: timestamp('created_at').defaultNow(), // Add created_at column definition
Which Claude 3 Sonnet correctly produces.

ChatGPT Classic (GPT-4) gives:

  created_at: timestamp('created_at').default(sql`NOW()`), // Add this line
Which is okay, but not ideal. And it also misses the need to import `sql` template tag.

Your share link gives:

  created_at: timestamp('created_at').default('NOW()'),
Which would throw a TypeScript error for the wrong type used in arguments for `default`.


Just played around with Opus. I'm starting to wonder if benchmarks are deviating from real world performance systematically - it doesn't seem actually better than GPT-4, slightly worse if anything.

Basic calculus/physics questions were worse off (it ignored my stating deceleration is proportional to velocity and just assumed constant).

A traffic simulation I've been using (understanding traffic light and railroad safety and walking through the AI like a kid) is underperforming GPT-4's already poor results, forgetting previous concepts discussed earlier in the conversation about directions/etc.

A test I conduct with understanding of primary light colors with in-context teaching is also performing worse.

On coding, it slightly underperformed GPT-4 at the (surprisingly hard for AI) question of computing long term capital gains tax, given ordinary income, capital gains, and ltcg brackets. Took another step of me correcting it (neither model can do it right 0 shot)



AI Explained on YouTube had a video some time ago about how the tests used for evaluating LLMs are close to useless due to being full of wrong answers.


They train the model, then as soon as they get their numbers, they let the safety people RLHF it to death.


I think it's just really hard to assess the performance of LLMs.

Also AI safety is the stated reason for Anthropic's existence, we can't be angry at them for making it a priority.



Just added Claude 3 to Chat at https://double.bot if anyone wants to try it for coding. Free for now and will push Claude 3 for autocomplete later this afternoon.

From my early tests this seems like the first API alternative to GPT4. Huge!



So double is like copilot, but free? What's the catch?


No catch. We're pretty early tbh so mostly looking to get some early power users and make the product great before doing a big launch. It's been popular with yc founders in the latest batches thus far but we haven't really shared publicly. We'll charge when we launch. If you try it now, I hope you'll share anything you liked and didn't like with us!


First time saw it, would love to try, do I need to uninstall co-pilot plugin to use double?


I guess your data is the catch.


We don't store or train on your data. You can see more details on our privacy policy here https://docs.double.bot/legal/privacy


Interesting - I had this exact question and tried the search on the website to find the answer with no result :D

Would be great to have an FAQ for this type of common question



Thanks for the feedback – what search terms did you use? Let me make sure those keywords are on the page :P


Probably not data so much as growth numbers to appease investors. Such offerings typically don’t last forever. Might as well take advantage while it lasts.


Hey Wesley, I just checked Double. Do you plan to support open source models hosted locally or on a cloud instance? Asking out of curiosity as I am building a product in the same space and have had a few people ask this. I guess since Double is an extension in IDEs, it can connect to AI models running anywhere.


it's an interesting idea. We asked our users this as well but at least for those we talked to, running their own model wasn't a big priority. What actually mattered to them is being able to try different (but high performance) models, privacy (their code not being trained on), and latency. We have some optimizations around time-to-first-token latency that would be difficult to do if we didn't have information about the model and their servers.


I see. Thanks Wesley for sharing and great to know it is not a priority. Also, the Mistral situation kinda makes me feel that big corps will want to host models.

Although, I feel Apple will break this trend and bring models to their chips rather than run them on the cloud. "Privacy first" will simply be a selling point for them but generally speaking cloud is not a big sell for them.

I am not at the level to do much optimizations, plus my product is a little more generic. To get to MVP, prompt engineering will probably be my sole focus.



Hi, what differentiates double from Cursor?


How do you guys compare to codium [0]? Also, any plans to support vim/neovim integration (codium has pretty good support in place [1]). Thanks.

[0] - https://www.codium.ai

[1] - https://github.com/Exafunction/codeium.vim



I think the tldr would be that they have more products (for example, their agent to write git commit messages). In the products we do have (autocomplete, chat), we spend a lot of time to get the details right. For example for autocomplete:

* we always close any brackets opened by autocomplete (and never extra brackets, which is the most annoying thing about github copilot)

* we automatically add import statements for libraries that autocomplete used

* mid-line completions

* we turn off autocomplete when you're writing a comment to avoid disrupting your train of thought

You can read more about these small details here: https://docs.double.bot/copilot

As you noted we don't have a vim integration yet, but it is on our roadmap!



Do note that Codium and Codeium are two completely separate companies. They work in related fields but have very different approaches.


Seems like the API is less reliable than GPT-4 so far, but I guess it makes sense for the endpoint to be popular at launch!


seems like the first API alternative to GPT4

What about Ultra?



To be clear: Is this Claude 3 Opus or the Sonnet model?


opus. only the best!


Awesome! I like the inline completions.

But could you let the users choose their keyboard shortcuts before setting the default ones?



Thanks for the feedback. I was actually reworking the default shortcuts and the onboarding process when I got pre-empted by claude. I was planning to change the main actions to alt-j, alt-k to minimize conflicts.

Are you asking because it conflicts with an existing shortcut on your setup? Or another reason?



Yes, it conflicts with some of my other shortcuts, but more generally, I think it'd be better to have consistent shortcuts, like CMD-CTRL-i for inline completion, CMD-CTRL-c for chat, etc.


more early impressions on performance: besides the endpoint erroring out at a higher rate than openai, time-to-first-token is also much slower :(

p50: 2.14s p95: 3.02s

And these aren't super long prompts either. vs gpt4 ttft:

p50: 0.63s p95: 1.47s



FYI That website doesn't work on QtWebEngine5.

(Chromium 87.0.4280.144 (Jan. 2021), plus security patches up to 119.0.6045.160 (Nov. 2023).)



Thank you for the report! We're using Mintlify for the docs (which that URL links to). Let me report it upstream to see if they can fix.


How do I change GPT4 to Claude 3 in double.bot?


It's default to claude 3 right now so I could get it out quick, but working on a toggle for the front-end now to switch between the two.


for future readers, the setting is now shipped in >v0.49. The default is now back to GPT-4 as it has lower latency but you can manually change it to Claude 3 in settings if you wish to try out Anthropic's new model.


It seems that a lot of the techies here have found it easy to find settings, but I seem to have trouble with that. Would you mind assisting me?


It's in the same place as settings are for any installed VSCode extension.


Yeah, I eventually found it. Thanks anyway :)

I noticed it might actually be a little more censored than the lmsys version. Lmsys seems more fine with roleplaying, while the one on Double doesn't really like it



Emacs implementation when? ;)


Just added it to gptel. (No image support though, it's a text-only LLM client.)


Thank you for working on gptel, it's an excellent package. I'm still using the copilot more because of the pure speed (competing with company mode/LSP), but I never use it if it suggests more than one line. The quality is just not there. But having access to gpt4 from gptel has been very useful. Can't wait to play around with Claude 3.


Wow, this was fast. Excellent!


If you use Emacs you're expected to know your way around programming and not need copilots :)


You have not checked GPTel then. It is super useful! Emacs really makes a good pairing with LLMs.


I just checked - surprisingly I cannot find any Emacs AI implementation that supports Claude's API.


Just added it to gptel.


Very nice!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com