(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43473489

Hacker News 的讨论帖围绕着 Google 发布 Gemini 2.5,他们最新的 AI 模型展开。评论者表达了似曾相似的感受,指出 AI 模型发布的重复性,这些发布总是吹嘘其最先进的性能和改进的推理能力。一些人质疑“.5”版本号的意义,认为这更多的是营销手段而非重大突破。 人们对缺乏价格信息表示担忧,这使得难以评估该模型的实际价值。其他人则强调其令人印象深刻的长上下文基准测试结果和改进的编码能力。一些用户对缺乏 Canvas 支持表示沮丧,并讨论了不同模型之间的一些对比。有一种观点认为 Google 正在努力追赶最新的进展。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Gemini 2.5: Our most intelligent AI model (blog.google)
75 points by meetpateltech 28 minutes ago | hide | past | favorite | 21 comments










These announcements have started to look like a template.

- Our state-of-the-art model.

- Benchmarks comparing to X,Y,Z.

- "Better" reasoning.

It might be an excellent model, but reading the exact text repeatedly is taking the excitement away.



I’m sure the AI helps write the announcements.


I wonder what about this one gets the +0.5 to the name. IIRC the 2.0 model isn’t particularly old yet. Is it purely marketing, does it represent new model structure, iteratively more training data over the base 2.0, new serving infrastructure, etc?

I’ve always found the use of the *.5 naming kinda silly when it became a thing. When OpenAI released 3.5, they said they already had 4 underway at the time, they were just tweaking 3 be better for ChatGPT. It felt like a scrappy startup name, and now it’s spread across the industry. Anthropic naming their models Sonnet 3, 3.5, 3.5 (new), 3.7 felt like the worst offender of this naming scheme.

I’m a much bigger fan of semver (not skipping to .5 though), date based (“Gemini Pro 2025”), or number + meaningful letter (eg 4o - “Omni”) for model names.



I would consider this a case of "expectation management"-based versioning. This is a release designed to keep Gemini in the news cycle, but it isn't a significant enough improvement to justify calling it Gemini 3.0.


At least for OpenAI, a .5 increment indicates a 10x increase in training compute. This so far seems to track for 3.5, 4, 4.5.


I think it's because of the big jump in coding benchmarks. 74% on aider is just much, much better than before and worthy of a .5 upgrade.


Agreed, can't everyone just use semantic versioning, with 0.1 increments for regular updates?


I'm most impressed by the improvement on Aider Polyglot; I wasn't expecting it to get saturated so quickly.

I'll be looking to see whether Google would be able to use this model (or an adapted version) to tackle ARC-AGI 2.



I wish they’d mention pricing - it’s hard to seriously benchmark models when you have no idea what putting it in production would actually cost.


The Long Context benchmark numbers seem super impressive. 91% vs 49% for GPT 4.5 at 128k context length.


Google has the upperhand here because they are not dependent on nvidia for hardware. They make and uses their own AI accelerators.


Isn't every new AI model the "most "?

Nobody is going to say "Announcing Foobar 7.1 - not our best!"



GPT-4.5's announcement was the equivalent of that.

"It beats all the benchmarks...but you really really don't want to use it."



They even priced it so people would avoid using it. GPT-4.5's entire function was to be the anchor of keeping OpenAI in the news, to keep up the perception of releasing quickly.


Except for GPT 4.5 and Claude 3.7 :/


gobble 2.0 - a bit of a turkey


> This will mark the first experimental model with higher rate limits + billing. Excited for this to land and for folks to really put the model through the paces!

From https://x.com/OfficialLoganK/status/1904583353954882046

The low rate-limit really hampered my usage of 2.0 Pro and the like. Interesting to see how this plays out.



Slight tangent: Interesting that they use o3-mini as the comparison rather than o1.

I've been using o1 almost exclusively for the past couple months and have been impressed to the point where I don't feel the need to "upgrade" for a better model.

Are there benchmarks showing o3-mini performing better than o1?



I noticed this too, I have used both o1 and o3 mini extensively, and I have ran many tests on my own problems and o1 solves one of my hardest prompts quite reliably but o3 is very inconsistent. So from my anecdotal experience o1 is a superior model in terms of capability.

The fact they would exclude it from their benchmarks seems biased/desperate and makes me trust them less. They probably thought it was clever to leave o1 out, something like "o3 is the newest model lets just compare against that", but I think for anyone paying attention that decision will backfire.



I find o3 at least faster to get to the response I care about, anecdotally.


why not enable Canvas for this model on Gemini.google.com? Arguably the weakest link of Canvas is the terrible code that Gemini 2.0 Flash writes for Canvas to run..






Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com