(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43707719

Hacker News 上的一个帖子讨论了 OpenAI 发布 o3 和 o4-mini 模型,重点关注 OpenAI 模型数量的增长(现在有 17 个,不包括较旧和专门的模型),相比之下 Anthropic 只有 7 个。评论者质疑这些是否是真正的新模型,还是仅仅为了吸引媒体关注而发布的现有模型的调整版本。一些人发现很难跟踪不同模型及其功能,建议使用任意名称而不是令人困惑的顺序编号。 帖子中还将 OpenAI 的模型与其他模型(如 Claude 3.7 和 Gemini Pro 2.5)进行了比较,尤其是在编码基准测试方面。一些用户报告说,尽管 OpenAI 已经宣布,但新模型尚未可用,而另一些用户则指出了开源编码工具 Codex CLI 的发布。人们批评了命名不一致以及模型之间缺乏明确比较的问题,导致一些用户取消订阅,转而使用 Gemini Pro 2.5 等替代方案。有人建议创建一个更一致的命名方案,该方案基于模型大小、发布日期和专业化。

相关文章
  • (评论) 2025-04-14
  • (评论) 2025-03-25
  • (评论) 2025-03-25
  • 关于 OpenAI 新的 o1 思想链模型的笔记 2024-09-14
  • (评论) 2025-04-06

  • 原文
    Hacker News new | past | comments | ask | show | jobs | submit login
    OpenAI o3 and o4-mini (openai.com)
    78 points by maheshrijal 26 minutes ago | hide | past | favorite | 62 comments










    So at this point OpenAI has 6 reasoning models, 4 flagship chat models, and 7 cost optimized models. So that's 17 models in total and that's not even counting their older models and more specialized ones. Compare this with Anthropic that has 7 models in total and 2 main ones that they promote.

    This is just getting to be a bit much, seems like they are trying to cover for the fact that they haven't actually done much. All these models feel like they took the exact same base model, tweaked a few things and released it as an entirely new model rather than updating the existing ones. In fact based on some of the other comments here it sounds like these are just updates to their existing model, but they release them as new models to create more media buzz.



    The old Chinese strategy of having 7343 different phone models with almost the same specs to confuse the customer better


    This is getting beyond ridiculous. Didn't they release 4.1 like two days ago or so? Why use this model over the other? It's hard to keep up.

    Seriously, what is going on there at OpenAI? Especially considering that they are aware of this issue, then double down on ... whatever this is.

    I understand that every model is basically its own thing, but at this point I'd rather give them arbitrary random names like Harold instead of sequential numberings that have nothing to do with the model's capabilities. They aren't even sequential!



    Very impressive! But under arguably the most important benchmark -- SWE-bench verified for real-world coding tasks -- Claude 3.7 still remains the champion.[1]

    Incredible how resilient Claude models are for best-in-coding class.

    [1] But by only about 1%, and inclusive of Claude's "custom scaffold" augmentation (which in practice I assume no one uses?). In practice, the new OpenAI models might be effectively best in class now!



    Gemini 2.5 Pro is widely considered superior to 3.7 Sonnet now by heavy users, but they don't have an SWE-bench score. Shows that looking at one such benchmark isn't very telling.


    Claude got 63.2% according to the swebench.com leaderboard (listed as "Tools + Claude 3.7 Sonnet (2025-02-24)).[0] OpenAI said they got 69.1% in their blog post.

    [0] swebench.com/#verified



    I have doubts whether the live stream was really live.

    During the live-stream the subtitles are shown line by line.

    When subtitles are auto-generated, they pop up word by word, which I assume would need to happen during a real live stream.

    Line-by-line subtitles are shown if the uploader provides captions by themselves for an existing video, the only way OpenAI could provide captions ahead of time, is if the "live-stream" isn't actually live.



    As a consumer, it is so exhausting keeping up with what model I should or can be using for the task I want to accomplish.


    I think it can be confusing if you're just reading the news. If you use ChatGPT, the model selector has good brief explanations and teaching you about newly available options if you don't visit the dropdown. Anthropic does similarly.


    It's becoming a bit like iphone 3, 4... 13, 25...

    Ok they are all phones that run apps and have a camera. I'm not an "AI power user", but I do talk to ChatGPT + Grok for daily tasks and use copilot.

    The big step function happened when they could search the web but not much else has changed in my limited experience.



    This is a very apt analogy.

    It confers to the speaker confirmation they're absolutely right - names are arbitrary.

    While also politely, implicitly, pointing out the core issue is it doesn't matter to you --- which is fine! --- but it may just be contributing to dull conversation to be the 10th person to say as much.



    This one seems to make it easier — if the promises here hold true, the multi-modal support probably makes o4-mini-high OpenAI's best model for most tasks unless you have time and money, in which case it's o3-pro.


    I asked OpenAI how to choose the right USB cable for my device. Now the objects around me are shimmering and winking out of existence, one by one. Help


    As a consumer, I read the list and 5 word description of each thing one time and now I know. Don't you guys ever get tired of pretending you can't read and remember like 4 simple things? How do you even do your job?


    I’m assuming when you say “read once”, that implies reading once every single release?

    It’s confusing. If I’m confused, it’s confusing. This is UX 101.



    "good at advanced reasoning", "fast at advanced reasoning", "slower at advanced reasoning but more advanced than the good one but not as fast but cant search the internet", "great at code and logic", "good for everyday tasks but awful at everything else", "faster for most questions but answers them incorrectly", "can draw but cant search", "can search but cant draw", "good for writing and doing creative things"


    Putting the actual list would have made it too clear that I'm right I see


    Some people don't blindly trust the marketing department of the publisher


    Then it doesn't even matter what they name the model since it's just marketing that they wouldn't trust anyway.


    Surprisingly, they didn't provide a comparison to Sonnet 3.7 or Gemini Pro 2.5—probably because, while both are impressive, they're only slightly better by comparison.

    Lets see what the pricing looks like.



    They didn't provide a comparison either in the GPT-4.1 release and quite a few past releases, which is telling of their attitude as an org.


    It's pretty frustrating to see a press release with "Try on ChatGPT" and then not see the models available even though I'm paying them $200/mo.


    They're supposed to be released today for everyone, and o3-pro for Pro users in a few weeks:

    "ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high."

    with rate limits unchanged



    I see o4-mini on the $20 tier but no o3.


    Same


    I'm starting to be reminded of the razor blade business.


    Where's the comparison with Gemini 2.5 Pro?


    For coding, I like the Aider polyglot benchmark, since it covers multiple programming languages.

    Gemini 2.5 Pro got 72.9%

    o3 high gets 81.3%, o4-mini high gets 68.9%



    Exactly.


    Maybe they should ask the new models to generate a better name for themselves. It's getting quite confusing.


    A suggestion for OpenAI to create more meaningful model names:

    {Size}-{Quarter/Year}-{Speed/Accuracy}-{Specialty}

    Where:

    * Size is XS/S/M/L/XL/XXL to indicate overall capability level

    * Quarter/Year like Q2-25

    * Speed/Accuracy indicated as Fast/Balanced/Precise

    * Optional specialty tag like Code/Vision/Science/etc

    Example model names:

    * L-Q2-25-Fast-Code (Large model from Q2 2025, optimized for speed, specializes in coding)

    * M-Q4-24-Balanced (Medium model from Q4 2024, balanced speed/accuracy)



    Maybe OpenAI needs an easy mode for all these people saying 5 choices of models (and that's only if you pay) is simply too confusing for them.

    They even provide a description in the UI of each before you select it, and it defaults to a model for you.

    If you just want an answer of what you should use and can't be bothered to research them, just use o3(4)-mini and call it a day.



    I personally like being able to choose because I understand the tradeoffs and want to choose the best one for what I’m asking. So I hope this doesn’t go away.

    But I agree that they probably need some kind of basic mode to make things easier for the average person. The basic mode should decide automatically what model to use and hide this from the user.



      ChatGPT Plus, Pro, and Team users will see o3, o4-mini, and o4-mini-high in the model selector starting today, replacing o1, o3‑mini, and o3‑mini‑high.
    
    I subscribe to pro but don't yet see the new models (either in the Android app or on the web version).


    Same...


    The pace of notable releases across the industry right now is unlike any time I remember since I started doing this in the early 2000's. And it feels like it's accelerating


    Not really. We’re definitely in the incremental improvement stage at this point. Certainly no indication that progress is “accelerating”.


    I wish companies would adhere to a consistent naming scheme, like --.


    Finally, a new SOTA model on SWE-bench. Love to see this progress, and nice to see OpenAI finally catching up in the coding domain.


    o3 is cheaper than o1. (per 1M tokens)

    • o3 Pricing:

      - Input: $10.00  
    
      - Cached Input: $2.50  
    
      - Output: $40.00
    
    • o1 Pricing:

      - Input: $15.00  
    
      - Cached Input: $7.50  
    
      - Output: $60.00
    
    o4-mini pricing remains the same as o3-mini.


    Buried in the article, a new CLI for coding:

    > Codex CLI is fully open-source at https://github.com/openai/codex today.



    4o and o4 at the same time. Excellent work on the product naming, whoever did that.


    I’m not sure I fully understand the rationale of having newer mini versions (eg o3-mini, o4-mini) when previous thinking models (eg o1) and smart non-thinking models (eg gpt-4.1) exist. Does anyone here use these for anything?


    I use o3-mini-high in Aider, where I want a model to employ reasoning but not put up with the latency of the non-mini o1.


    o1 is a much larger, more expensive to operate on OpenAI's end. Having a smaller "newer" (roughly equating newer to more capable) model means that you can match the performance of larger older models while reducing inference and API costs.


    Is there a non-obvious reason using something like Python to solve queries requiring calculations was not used from day one with LLMs?


    Because it‘s not a feature of the LLM but the product that is built around it (like ChatGPT).


    It's true that product provides the tools, but the model still needs to be trained to use tools, or it won't use them well or at the right times.


    Not sure what the goal is with Codex CLI. It's not running a local LLM right, just a CLI to make API calls from the terminal?


    This might be their answer to claude code more than anything else.


    Looks more like a direct competitor to Aider.


    Are these available via the API? I'm getting back 'model_not_found' when testing.


    If the ai is smart, why not have it choose the model for the user


    That’s what GPT-5 was supposed to be (instead of a new base or reasoning model) last Sam updated his plans I thought. Did those change again?


    Underwhelming. Cancelled my subscription in favor of Gemini Pro 2.5


    What is wrong with OpenAI? The naming of their models seems like it is intentionally confusing - maybe to distract from lack of progress? Honestly, I have no idea which model to use for simply everyday tasks anymore.


    It really is bizarre. If you had asked me 2 days ago I would have said unequivically that these models already existed. Surely given the rate of change a date-based numbering system would be more helpful?




    Seems to me like they're somewhat trying to simplify now.

    GPT-N.m -> Non-reasoning

    oN -> Reasoning

    oN+1-mini -> Reasoning but speedy; cut-down version of an upcoming oN model (unclear if true or marketing)

    It would be nice if they actually stick to this pattern.



    Are the oN models built on top of GPT-N.m models? It would be nice to know the lineage there.


    I tend to look at the lmarena leaderboard to see what to use (or the aider polyglot leaderboard for coding)


    OpenAI be like:

        o1, o1-mini,
        o1-pro, o3,
        o4-mini, gpt-4,
        gpt-4o, gpt-4-turbo,
        gpt-4.5, gpt-4.1,
        gpt-4o-mini, gpt-4.1-mini,
        gpt-4.1-nano, gpt-3.5-turbo






    Join us for AI Startup School this June 16-17 in San Francisco!


    Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



    Search:
    联系我们 contact @ memedata.com