5 亿 GPT 代币后的教训

5 亿 GPT 代币后的教训
Lessons after a Half-billion GPT Tokens

原始链接: https://kenkantzer.com/lessons-after-a-half-billion-gpt-tokens/

您的初创公司 gettruss.io 在过去六个月中实施了多项基于语言模型 (LLM) 的大型功能。尽管 Hacker News 上关于法学硕士的说法各不相同，但您希望分享一些从管理近 5 亿个代币中获得的见解。您的主要学习内容包括： 1. 最小化提示长度：选择简洁的提示而不是详细的说明会带来卓越的性能。过度指定的指令会让法学硕士感到困惑。有效的策略包括信任法学硕士对常识的理解，从而提高结果。 2. 首选简单聊天：使用 OpenAI 的基本聊天界面，无需使用 Langchain 等额外 API，即可获得理想的结果。保持简单并坚持要点。 3. 通过流 API 改善延迟：利用 ChatGPT 的流 API 可以通过实时输入和减少延迟来增强用户体验。 4. 零假设挑战：避免要求法学硕士产生“空白”结果可以减少错误，尤其是在面对模糊的场景时。指示模型仅在存在相关材料时提供信息。 5. 上下文窗口误解：理解术语“上下文窗口”有些误导——虽然 GPT-4 具有很大的输入窗口，但输出窗口仍然有限。相应地调整期望并优先考虑请求处理的效率。这些发现使 Gettruss.io 能够有效地将法学硕士用于其特定用例，同时取得优异的结果。这项技术的持续探索和进步预计将在未来取得重大进展。

该团队每月处理超过 50 亿个代币，并且编辑负责监督此操作。他们根据自己的经验确定了三个要点： 1. 过度抽象，例如Langchain，可能看起来很有前途，但出于实际目的，对于大型语言模型来说，直接代码处理更简单、更可靠。尽管取得了进步，但幻觉，特别是在总结和推理领域，仍然带来了重大挑战。 2. 编辑承认这些语言模型的变革潜力，但仍保持谨慎乐观，将它们视为力量倍增器，而不是像互联网那样的革命性突破。某些角色可能会面临重大影响，导致机器人在各个平台上出现一个艰难的过渡期。 3. 改进建议包括关注事实而不是开放式请求、质疑排名的可靠性、使用英语进行查询以及确保清晰的 JSON 示例。此外，考虑到成本和潜在的限制，建议尝试使用 OpenRouter 等替代模型和服务。

原文

My startup (gettruss.io) released a few LLM-heavy features in the last six months, and the narrative around LLMs that I read on Hacker News is now starting to diverge from my reality, so I thought I’d share some of the more “surprising” lessons after churning through just north of 500 million tokens, by my estimate.

Some details first:

– we’re using the OpenAI models, see the Q&A at the bottom if you want my opinion of the others

– our usage is 85% GPT-4, and 15% GPT-3.5

– we deal exclusively with text, so no gpt-4-vision, Sora, whisper, etc.

– we have a B2B use case – strongly focused on summarize/analyze-extract, so YMMV

– 500M tokens actually isn’t as much as it seems – it’s about 750,000 pages of text, to put it in perspective

Lesson 1: When it comes to prompts, less is more

We consistently found that not enumerating an exact list or instructions in the prompt produced better results, if that thing was already common knowledge. GPT is not dumb, and it actually gets confused if you over-specify.

This is fundamentally different than coding, where everything has to be explicit.

Here’s an example where this bit us:

One part of our pipeline reads some block of text and asks GPT to classify it as relating to one of the 50 US states, or the Federal government. This is not a hard task – we probably could have used string/regex, but there’s enough weird corner cases that that would’ve taken longer. So our first attempt was (roughly) something like this:

Here's a block of text. One field should be "locality_id", and it should be the ID of one of the 50 states, or federal, using this list:
[{"locality: "Alabama", "locality_id": 1}, {"locality: "Alaska", "locality_id": 2} ... ]

This worked sometimes (I’d estimate >98% of the time), but failed enough that we had to dig deeper.

While we were investigating, we noticed that another field, name, was consistently returning the full name of the state…the correct state – even though we hadn’t explicitly asked it to do that.

So we switched to a simple string search on the name to find the state, and it’s been working beautifully ever since.

I think in summary, a better approach would’ve been “You obviously know the 50 states, GPT, so just give me the full name of the state this pertains to, or Federal if this pertains to the US government.”

Why is this crazy? Well, it’s crazy that GPT’s quality and generalization can improve when you’re more vague – this is a quintessential marker of higher-order delegation / thinking.

(Random side note one: GPT was failing most often with the M states — Maryland, Maine, Massachusettes, Michigan — which you might expect of a fundamentally stochastic model.)

(Random side note two: when we asked GPT to choose an ID from a list of items, it got confused a lot less when we sent the list as prettified JSON, where each state was on its own line. I think \n is a stronger separator than a comma.)

Lesson 2: You don’t need langchain. You probably don’t even need anything else OpenAI has released in their API in the last year. Just chat. That’s it.

Langchain is the perfect example of premature abstraction. We started out thinking we had to use it because the internet said so. Instead, millions of tokens later, and probably 3-4 very diverse LLM features in production, and our openai_service file still has only one, 40-line function in it:

def extract_json(prompt, variable_length_input, number_retries)

The only API we use is chat. We always extract json. We don’t need JSON mode, or function calling, or assistants (though we do all that). Heck, we don’t even use system prompts (maybe we should…). When a gpt-4-turbo was released, we updated one string in the codebase.

This is the beauty of a powerful generalized model – less is more.

Most of the 40 lines in that function are around error handling around OpenAI API’s regular 500s/socket closed (though it’s gotten better, and given their load, it’s not surprising).

There’s some auto-truncating we built in, so we don’t have to worry about context length limits. We have my own proprietary token-length estimator. Here it is:

if s.length > model_context_size * 3
  # truncate it!
end

It fails in corner cases when there are a LOT of periods, or numbers (the token ratio is < 3 characters / token for those). So there’s another very proprietary try/catch retry logic:

if response_error_code == "context_length_exceeded"
   s.truncate(model_context_size * 3 / 1.3)

We’ve gotten quite far with this approach, and it’s been flexible enough for our needs.

Lesson 3: improving the latency with streaming API and showing users variable-speed typed words is actually a big UX innovation with ChatGPT.

We thought this was a gimmick, but users react very positively to variable-speed “typed” characters – this feels like the mouse/cursor UX moment for AI.

Lesson 4: GPT is really bad at producing the null hypothesis

“Return an empty output if you don’t find anything” – is probably the most error-prone prompting language we came across. Not only does GPT often choose to hallucinate rather than return nothing, it also causes it to just lack confidence a lot, returning blank more often than it should.

Most of our prompts are in the form:

“Here’s a block of text that’s making a statement about a company, I want you to output JSON that extracts these companies. If there’s nothing relevant, return a blank. Here’s the text: [block of text]”

For a time, we had a bug where [block of text] could be empty. The hallucinations were bad. Incidentally, GPT loves to hallucinate bakeries, here are some great ones:

Sunshine Bakery
Golden Grain Bakery
Bliss Bakery

Fortunately, the solution was to fix the bug and not send it a prompt at all if there was no text (duh!). But it’s harder when “it’s empty” is harder to define programmatically, and you actually do need GPT to weigh in.

Lesson 5: “Context windows” are a misnomer – and they are only growing larger for input, not output

Little known fact: GPT-4 may have a 128k token window for input, but it’s output window is still a measly 4k! Calling it a “context window” is confusing, clearly.

But the problem is even worse – we often ask GPT to give us back a list of JSON objects. Nothing complicated mind you: think, an array list of json tasks, where each task has a name and a label.

GPT really cannot give back more than 10 items. Trying to have it give you back 15 items? Maybe it does it 15% of the time.

We originally thought this was because of the 4k context window, but we were hitting 10 items, and it’d only be maybe 700-800 tokens, and GPT would just stop.

Now, you can of course trade in output for input by giving it a prompt, ask for a single task, then give it (prompt + task), ask for the next task, etc. But now you’re playing a game of telephone with GPT, and have to deal with things like Langchain.

Lesson 6: vector databases, and RAG/embeddings are mostly useless for us mere mortals

I tried. I really did. But every time I thought I had a killer use case for RAG / embeddings, I was confounded.

I think vector databases / RAG are really meant for Search. And only search. Not search as in “oh – retrieving chunks is kind of like search, so it’ll work!”, real google-and-bing search. Here’s some reasons why:

there’s no cutoff for relevancy. There are some solutions out there, and you can create your own cutoff heuristics for relevancy, but they’re going to be unreliable. This really kills RAG in my opinion – you always risk poisoning your retrieval with irrelevant results, or being too conservative, you miss important results.
why would you put your vectors in a specialized, proprietary database, away from all your other data? Unless you are dealing at a google/bing scale, this loss of context absolutely isn’t worth the tradeoff.
unless you are doing a very open-ended search, of say – the whole internet – users typically don’t like semantic searches that return things they didn’t directly type. For most applications of search within business apps, your users are domain experts – they don’t need you to guess what they might have meant – they’ll let you know!

It seems to me (this is untested) that a much better use of LLMS for most search cases is to use a normal completion prompt to convert a user’s search into a faceted-search, or even a more complex query (or heck, even SQL!). But this is not RAG at all.

Lesson 7: Hallucination basically doesn’t happen.

Every use case we have is essentially “Here’s a block of text, extract something from it.” As a rule, if you ask GPT to give you the names of companies mentioned in a block of text, it will not give you a random company (unless there are no companies in the text – there’s that null hypothesis problem!).

Similarly — and I’m sure you’ve noticed this if you’re an engineer — GPT doesn’t really hallucinate code – in the sense that it doesn’t make up variables, or randomly introduce a typo in the middle of re-writing a block of code you sent it. It does hallucinate the existence of standard library functions when you ask it to give you something, but again, I see that more as the null hypothesis. It doesn’t know how to say “I don’t know”.

But if your use case is entirely, “here’s the full context of details, analyze / summarize / extract” – it’s extremely reliable. I think you can see a lot of product releases recently that emphasize this exact use case.

So it’s all about good data in, good GPT tokens responses out.

Where do I think all this is heading?

Rather than responding with some long-form post, here’s a quick Q&A:

Are we going to achieve AGI?

No. Not with this transformers + the data of the internet + $XB infrastructure approach.

Is GPT-4 actually useful, or is it all marketing?

It is 100% useful. This is the early days of the internet still. Will it fire everyone? No. Primarily, I see this lowering the barrier of entry to ML/AI that was previously only available to Google.

Have you tried Claude, Gemini, etc?

Yeah, meh. Actually in all seriousness, we haven’t done any serious A/B testing, but I’ve tested these with my day to day coding, and it doesn’t feel even close. It’s the subtle things mostly, like intuiting intention.

How do I keep up to date with all the stuff happening with LLMs/AI these days?

You don’t need to. I’ve been thinking a lot about The Bitter Lesson – that general improvements to model performance outweigh niche improvements. If that’s true, all you need to worry about is when GPT-5 is coming out. Nothing else matters, and everything else being released by OpenAI in the meantime (not including Sora, etc, that’s a whooolle separate thing) are basically noise.

So when will GPT-5 come out, and how good will it be?

I’ve been trying to read the signs with OpenAI, as has everyone else. I think we’re going to see incremental improvement, sadly. I don’t have a lot of hope that GPT-5 is going to “change everything”. There are fundamental economic reasons for that: between GPT-3 and GPT-3.5, I thought we might be in a scenario where the models were getting hyper-linear improvement with training: train it 2x as hard, it gets 2.2x better.

But that’s not the case, apparently. Instead, what we’re seeing is logarithmic. And in fact, token speed and cost per token is growing exponentially for incremental improvements.

If that’s the case, there’s some Pareto-optimal curve we’re on, and GPT-4 might be optimal: whereas I was willing to pay 20x for GPT-4 over GPT-3.5, I honestly don’t think I’d pay 20x per token to go from GPT-4 to GPT-5, not for the set of tasks that GPT-4 is used for.

GPT-5 may break that. Or, it may be the iPhone 5 to the iPhone 4. I don’t think that’s a loss!

5 亿 GPT 代币后的教训 Lessons after a Half-billion GPT Tokens

5 亿 GPT 代币后的教训
Lessons after a Half-billion GPT Tokens