原文
原始链接: https://news.ycombinator.com/item?id=41523070
简单来说,用户分析了各种因素来评估大型语言模型的效率和成本,特别是 OpenAI 的最新产品 O1。 总结如下: 1. 规模:为了处理更大的数据集,与较小的版本相比,O1 模型的规模扩大了大约 2 个数量级(100 至 200 倍)。 2. 效率:根据观察到的数据,O1 模型可能需要 1000 倍以上的资源用于“推理成本”(即计算能力),才能产生与最小版本类似的结果。 3. 时间:从提供的示例来看,生成响应背后的推理过程可能存在很大差异 - 从数百个到数万个令牌。 实际响应大小可能比生成的令牌大 10 倍。 4. 定价:根据 OpenAI 的定价信息,O1 模型似乎比较小的模型要贵得多,输入和输出代币的价格都较高。 总体而言,用户得出的结论是,O1 需要大量计算资源才能生成更长、更详细的响应,这解释了使用这种强大但昂贵的模型所需的高成本。
First, from the original plot, we have roughly 2 orders of magnitude to cover (~100-200x)
Next, from the cost plots: super handwaving guess, but since 5.77 / 0.32 = ~18, and the relative cost for gpt-4o vs gpt-4o-mini is ~20-30, this roughly lines up. This implies that o1 costs ~1000x the cost than gpt-4o-mini for inference (not due to model cost, just due to the raw number of chain of thought tokens it produces). So, my first "statement", is that I trust the "Math performance vs Inference Cost" plot on the o1-mini page to accurately represent "cost" of inference for these benchmark tests. This is now a "cost" relative set of numbers between o1 and 4o models.
I'm also going to make an assumption that o1 is roughly the same size as 4o inherently, and then from that and the SVG, roughly going to estimate that they did a "net" decoding of ~100x for the o1 benchmarks in total. (5.77 vs (354.77 - 635)).
Next, from the CoT examples they gave us, they actually show the CoT preview where (for the math example) it says "...more lines cut off...", A quick copy paste of what they did include includes ~10k tokens (not sure if copy paste is good though..) and from the cipher text example I got ~5k tokens of CoT, while there are only ~800 in the response. So, this implies that there's a ~10x size of response (decoded tokens) in the examples shown. It's possible that these are "middle of the pack" / "average quality" examples, rather than the "full CoT reasoning decoding" that they claim they use. (eg. from the log scale plot, this would come from the middle, essentially 5k or 10k of tokens of chain of thought). This also feels reasonable, given that they show in their API [3] some limits on the "reasoning_tokens" (that they also count)
All together, the CoT examples, pricing page, and reasoning page all imply that reasoning itself can be variable length by about ~100x (2 orders of magnitude), eg. example: 500, 5k (from examples) or up to 65,536 tokens of reasoning output (directly called out as a maximum output token limit).
Taking them on their word that "pass@1" is honest, and they are not doing k-ensembles, then I think the only reasonable thing to assume is that they're decoding their CoT for "longer times". Given the roughly ~128k context size limit for the model, I suspect their "top end" of this plot is ~100k tokens of "chain of thought" self-reflection.
Finally, at around 100 tokens per second (gpt-4o decoding speed), this leaves my guess for their "benchmark" decoding time at the "top-end" to be between ~16 minutes (full 100k decoding CoT, 1 shot) for a single test-prompt, and ~10 seconds on the low end. So for that X axis on the log scale, my estimate would be: ~3-10 seconds as the bottom X, and then 100-200x that value for the highest value.
All together, to answer your question: I think the 80% accuracy result took about ~10-15 minutes to complete. I also believe that the "decoding cost" of o1 model is very close to the decoding cost of 4o, just that it requires many more reasoning tokens to complete. (and then o1-mini is comparable to 4o-mini, but also requiring more reasoning tokens)
[1] https://openai.com/index/openai-o1-mini-advancing-cost-effic...
[2] https://openai.com/api/pricing/ [3] https://platform.openai.com/docs/guides/reasoning