GPT-4o 的内存突破—

GPT-4o 的内存突破——针中针
GPT-4o's Memory Breakthrough – Needle in a Needlestack

Tom Burns 的“大海捞针”是一种新颖的评估工具，用于评估语言模型 (LLM) 在广泛的上下文窗口中关注特定细节的能力。该基准测试包含数千首打油诗，并且需要回答有关单个打油诗的特定位置的查询。迄今为止，直到最近，所有接受测试的法学硕士都没有表现出卓越的表现。 GPT-4 Turbo 和 Claude-3 Sonnet 尝试完成这项任务，但没有成功。然而，GPT-4 Turbo 取得了重大进展，显示出近乎完美的结果。 Mistral 模型在该基准测试中遇到了挑战，较小的模型一开始只能达到 50% 的准确率，而较大的版本则提高到 70%。这些发现表明，法学硕士在简洁的输入下表现最佳，因为重复目标打油诗导致 GPT-3.5-Turbo 的实质性改进。如需进一步了解，请访问提供的评估方法链接，如有疑问，请联系作者。

在本文中，作者讨论了大型语言模型 (LLM) 在回答开放式问题时的各种局限性，特别是与打油诗相关的问题。他们发现，如果上下文中没有打油诗，法学硕士很难提供正确的答案。作者对此类数据的训练是否会影响模型的性能提出了担忧，表明它可能严重依赖于之前学习的数据，而不是真正理解上下文。他们还提出了改进建议，例如评估模型的性能，排除上下文中的打油诗。此外，他们分享了令人印象深刻的表现示例，特别是在从大量文本中识别相关句子方面，并强调了持续学习和适应对于增强法学硕士能力的重要性。

by Tom Burns

Needle in a Needlestack is a new benchmark to measure how well LLMs pay attention to the information in their context window. NIAN creates a prompt that includes thousands of limericks and the prompt asks a question about one limerick at a specific location. Here is an example prompt that includes 2500ish limericks. Until today, no LLM was very good at this benchmark. Here are GPT-4 Turbo and Claude-3 Sonnet’s attempts at this benchmark:

However, GPT-4o has made a breakthrough! Check out how well it does on this benchmark:

Wow! GPT-4o is almost perfect

Wow! GPT-4o is almost perfect

I wonder when OpenAI will reveal what they did to make GPT-4o so much better than GPT-4 Turbo?

Mistral’s models are really nice to work with. Their API is very fast and consistent. However, Mistral’s new 8x22 model had a really hard time with this benchmark. Even at the beginning of the prompt it could only answer the question correctly 50% of the time. Mistral large did better, but still only got up to 70% correct. Note: I used OpenAI’s tokenizer to estimate token counts. Mistral uses a different tokenizer that generates about 25% more tokens, so the token counts in the graphs are lower than the actual token counts.

Models do quite a bit better with shorter prompts. Here is Mistral 7b with a 16k-ish token prompt, vs 32k-ish

Repeating information can make a very big difference on this test. GPT-3.5-turbo does dramatically better when the limerick the prompt asks about is repeated 10 times.

The code for this benchmark is here. It should be easy to add support for additional models. You can read more about how answers are evaluated and questions are vetted on the methodology page. If you have any questions, please contact me