关于 OpenAI 新的 o1 思想链模型的笔记

关于 OpenAI 新的 o1 思想链模型的笔记
Notes on OpenAI's new o1 chain-of-thought models

原始链接: https://simonwillison.net/2024/Sep/12/openai-o1/

2024年9月12日，OpenAI宣布发布两款新的AI模型，名为“o1-preview”和“o1-mini”。这些模型此前被称为“草莓”，根据 OpenAI 的描述，它们专注于提高模型“先思考后响应”的能力。训练涉及强化学习方法，该方法教会模型在高效的训练过程中有效利用其思维链。与之前的模型（例如 GPT-4）相比，新模型通过更好地处理需要大量回溯和思考的复杂提示，表现出改进的“推理”能力。然而，新模型在成本、性能和应用限制方面存在一定的权衡，例如响应时间较慢以及可以执行的功能类型受到限制。此外，尽管推理令牌仍然隐藏在响应本身中，但使用推理令牌会增加 API 账单的成本。虽然新模型为需要深度推理的特定任务提供了有希望的改进，但它们可能并不适合需要更快响应时间、图像输入或函数调用的应用程序。对新 API 的访问仅限于在 API 积分上花费至少 1,000 美元的第 5 级帐户。此外，不支持系统提示、流媒体支持、工具使用、批量调用或图像输入，使得新模型与以前的模型相比通用性较差。总之，OpenAI 推出了新的人工智能模型——o1-preview 和 o1-mini——专注于通过强化学习增强模型的推理能力。虽然这些新模型改进了对复杂、多步骤问题的处理，但在成本、功能和可访问性方面引入了各种限制。

当被问及国际象棋中两种 Isolani 结构之间的差异时，一种类型在 e6 上有黑棋，另一种在 c6 上有黑棋，AI 没有只关注主要区别，而是提供了额外的细节与具体问题无关。系统给出提示后，准确识别出了因棋子缺失而削弱的对角线。然而，它过于冗长并且包含不必要的细节。总结一下，当c6上没有黑棋子（e6棋子）时，f1-a6对角线由于缺乏控制而变弱；相反，如果 e6 上没有黑棋子（c6 棋子），则 c8-h3 对角线变弱。 f1-a6 对角线更为重要，因为它为白方提供了更多的战略选择。

原文

12th September 2024

OpenAI released two major new preview models today: o1-preview and o1-mini (that mini one is not a preview)—previously rumored as having the codename “strawberry”. There’s a lot to understand about these models—they’re not as simple as the next step up from GPT-4o, instead introducing some major trade-offs in terms of cost and performance in exchange for improved “reasoning” capabilities.

Trained for chain of thought

OpenAI’s elevator pitch is a good starting point:

We’ve developed a new series of AI models designed to spend more time thinking before they respond.

One way to think about these new models is as a specialized extension of the chain of thought prompting pattern—the “think step by step” trick that we’ve been exploring as a a community for a couple of years now, first introduced in the paper Large Language Models are Zero-Shot Reasoners in May 2022.

OpenAI’s article Learning to Reason with LLMs explains how the new models were trained:

Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.

[...]

Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses. It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working. This process dramatically improves the model’s ability to reason.

Effectively, this means the models can better handle significantly more complicated prompts where a good result requires backtracking and “thinking” beyond just next token prediction.

I don’t really like the term “reasoning” because I don’t think it has a robust definition in the context of LLMs, but OpenAI have committed to using it here and I think it does an adequate job of conveying the problem these new models are trying to solve.

Low-level details from the API documentation

Some of the most interesting details about the new models and their trade-offs can be found in their API documentation:

For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice. However, if you’re aiming to develop applications that demand deep reasoning and can accommodate longer response times, the o1 models could be an excellent choice.

Some key points I picked up from the docs:

API access to the new o1-preview and o1-mini models is currently reserved for tier 5 accounts—you’ll need to have spent at least $1,000 on API credits.
No system prompt support—the models use the existing chat completion API but you can only send user and assistant messages.
No streaming support, tool usage, batch calls or image inputs either.
“Depending on the amount of reasoning required by the model to solve the problem, these requests can take anywhere from a few seconds to several minutes.”

Most interestingly is the introduction of “reasoning tokens”—tokens that are not visible in the API response but are still billed and counted as output tokens. These tokens are where the new magic happens.

Thanks to the importance of reasoning tokens—OpenAI suggests allocating a budget of around 25,000 of these for prompts that benefit from the new models—the output token allowance has been increased dramatically—to 32,768 for o1-preview and 65,536 for the supposedly smaller o1-mini! These are an increase from the gpt-4o and gpt-4o-mini models which both currently have a 16,384 output token limit.

One last interesting tip from that API documentation:

Limit additional context in retrieval-augmented generation (RAG): When providing additional context or documents, include only the most relevant information to prevent the model from overcomplicating its response.

This is a big change from how RAG is usually implemented, where the advice is often to cram as many potentially relevant documents as possible into the prompt.

Hidden reasoning tokens

A frustrating detail is that those reasoning tokens remain invisible in the API—you get billed for them, but you don’t get to see what they were. OpenAI explain why in Hiding the Chains of Thought:

Assuming it is faithful and legible, the hidden chain of thought allows us to “read the mind” of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users.

So two key reasons here: one is around safety and policy compliance: they want the model to be able to reason about how it’s obeying those policy rules without exposing intermediary steps that might include information that violates those policies. The second is what they call competitive advantage—which I interpret as wanting to avoid other models being able to train against the reasoning work that they have invested in.

I’m not at all happy about this policy decision. As someone who develops against LLMs interpretability and transparency are everything to me—the idea that I can run a complex prompt and have key details of how that prompt was evaluated hidden from me feels like a big step backwards.

Examples

OpenAI provide some initial examples in the Chain of Thought section of their announcement, covering things like generating Bash scripts, solving crossword puzzles and calculating the pH of a moderately complex solution of chemicals.

These examples show that the ChatGPT UI version of these models does expose details of the chain of thought... but it doesn’t show the raw reasoning tokens, instead using a separate mechanism to summarize the steps into a more human-readable form.

OpenAI also have two new cookbooks with more sophisticated examples, which I found a little hard to follow:

I asked on Twitter for examples of prompts that people had found which failed on GPT-4o but worked on o1-preview. A couple of my favourites:

How many words are in your response to this prompt? by Matthew Berman—the model thinks for ten seconds across five visible turns before answering “There are seven words in this sentence.”
Explain this joke: “Two cows are standing in a field, one cow asks the other: “what do you think about the mad cow disease that’s going around?”. The other one says: “who cares, I’m a helicopter!” by Fabian Stelzer—the explanation makes sense, apparently other models have failed here.

Great examples are still a bit thin on the ground though. Here’s a relevant note from OpenAI researcher Jason Wei, who worked on creating these new models:

Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

Ethan Mollick has been previewing the models for a few weeks, and published his initial impressions. His crossword example is particularly interesting for the visible reasoning steps, which include notes like:

I noticed a mismatch between the first letters of 1 Across and 1 Down. Considering “CONS” instead of “LIES” for 1 Across to ensure alignment.

What’s new in all of this

It’s going to take a while for the community to shake out the best practices for when and where these models should be applied. I expect to continue mostly using GPT-4o (and Claude 3.5 Sonnet), but it’s going to be really interesting to see us collectively expand our mental model of what kind of tasks can be solved using LLMs given this new class of model.

I expect we’ll see other AI labs, including the open model weights community, start to replicate some of these results with their own versions of models that are specifically trained to apply this style of chain-of-thought reasoning.