为什么开发者持续选择Claude而不是其他人工智能

为什么开发者持续选择Claude而不是其他人工智能
Why Developers Keep Choosing Claude over Every Other AI

原始链接: https://www.bhusalmanish.com.np/blog/posts/why-claude-wins-coding.html

## 克劳德编程优势之谜尽管在编程基准测试中始终名列前茅，但较新的AI模型往往无法取代克劳德成为开发者的首选工具。这并非因为基准测试不准确——这些模型*可以*独立生成优秀的代码——而是因为它们在*如何*编写代码方面存在关键差异。克劳德不仅擅长生成正确的代码片段，还擅长驾驭整个编程*过程*：管理复杂任务、阅读相关文件、进行有针对性的编辑、从错误中恢复，并在长时间的工作流程中保持专注。其他模型虽然智能，但在这种持续的交互过程中却举步维艰，经常“卡住”、失去上下文或做出破坏性更改。关键在于Anthropic专注于编程工作流程的训练，优先考虑过程规范而非原始智能。这与谷歌等公司形成对比，它们的模型针对更广泛的任务进行了优化。虽然Gemini可以为特定问题生成令人印象深刻的代码，但它缺乏克劳德在多步骤、现实世界开发场景中的可靠性。最终，成功的AI编程辅助60%在于代码的*集成方式*——编辑、文件管理、任务完成——而40%在于生成正确的代码。虽然差距正在缩小，但克劳德目前在持续交付编程过程中至关重要且常常被忽视的方面方面领先。

黑客新闻新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交登录为什么开发者持续选择 Claude 而不是其他任何 AI (bhusalmanish.com.np) 10 分，由 okchildhood 1 小时前发布 | 隐藏 | 过去的 | 收藏 | 3 条评论帮助 xnorswap 12 分钟前 | 下一个 [–] 整篇文章也是用 Claude 的语气写的，所以我想作者也持续选择 Claude 来写博客。我宁愿读到作者原创的两个简短段落，也不愿读到精心呈现的扩展成十二个段落，充满着荒谬的短促句子的文章。回复 mrks_hy 13 分钟前 | 上一个 | 下一个 [–] 我发誓，这些统一的、AI驱动的低质量文章开头真的让我恼火：不是 X，而是 Y！没错。更多的词语。加上。句号。至少要有自尊，用提示词来摆脱这种缺乏风格的通用模式。回复 stephc_int13 7 分钟前 | 上一个 [–] ai;dr 回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：

I use AI coding tools every day. Claude Code for most of my actual work. I've tried the alternatives - Gemini, Codex, open-source models. I keep coming back. Not because of loyalty. Not because of marketing. Because the alternatives keep failing me in the same specific way.

A new model drops. It tops the benchmarks. Developers try it. Developers complain. They go back to Claude. This has happened three or four times now, and the pattern is consistent enough that it deserves an explanation.

Benchmarks Are Not Lying. But They're Not Telling You What You Think.

When a new AI model tops the coding benchmarks, the benchmarks are usually accurate. The model genuinely produces better code on isolated problems. Higher accuracy on HumanEval. Cleaner solutions on LeetCode-style tasks. The numbers are real.

Older benchmarks like HumanEval work exactly like that - you get an isolated function to write, and you're graded on whether it passes unit tests. Newer benchmarks like SWE-bench are more realistic. They give the model real GitHub issues from real repos and ask it to generate patches. That's closer to actual development.

But even SWE-bench is still a controlled environment. Real coding work has more going on. You're managing a conversation with the user. You're deciding which files to read and which to skip. You're making targeted edits without breaking surrounding code. You're hitting unexpected errors and deciding whether to ask for help or try a different approach. You're staying on task across 20+ steps without drifting. That kind of sustained, interactive workflow is hard to capture in any benchmark.

The Process vs. Raw Intelligence Gap

The most useful frame I've found for understanding this: Anthropic appears to have trained Claude heavily on the process of coding, not just the output. The workflow. The sequence of decisions a competent developer actually makes when given a task in a real codebase.

To be clear - every major coding agent can read files, edit code, and run terminal commands. Codex, Antigravity, Gemini CLI - they all have these capabilities. The difference is in how consistently the model behind them executes the workflow. Reading the right files before making changes. Making targeted edits instead of rewriting entire files unnecessarily. Knowing when to act and when to stop and ask. Staying on the original task instead of getting distracted.

All these tools can do it. Claude does it more reliably. Other models produce excellent code - sometimes arguably better than Claude's on a per-snippet basis. The gap isn't in any individual output. It's in the consistency across a full task. They loop more often. They lose track of what they were doing mid-sequence. They make edits that break surrounding context. They need more steering to stay on track. Not always - but often enough that it changes how much you can trust the tool to work unsupervised.

The difference isn't raw intelligence. It's process discipline. And that's harder to train for than most people realize.

What "Good at Coding" Actually Requires

Generating correct code is maybe 40% of what an AI coding assistant needs to do well. The other 60% is everything around the code:

Editing files without corrupting surrounding code
Reading the right files before making changes
Completing a multi-step task without losing the thread halfway through
Communicating clearly about what it's doing and what it found
Knowing when to ask instead of assuming
Staying on task instead of making unrequested changes to unrelated files

Every major coding agent attempts all of these. The question is how often they succeed at each one across a full task. In my experience using Claude Code daily - building API endpoints, debugging production issues, refactoring components - it hits these consistently. Not perfectly, but consistently enough that I don't feel like I need to watch every step.

With other tools, I find myself intervening more. The code they generate is often just as good. But somewhere in the middle of a multi-file task, something slips - a file gets partially overwritten, or the model goes off and starts "improving" something I didn't ask about. That's the gap. It's not about capability. It's about how often the tool stays on track without you having to course-correct.

Why Google Has a Structural Problem Here

I want to be fair. Gemini writes excellent code. The underlying model is clearly very capable. Give it a well-contained problem with a clear spec and it'll produce a good solution. Sometimes a great one.

The problem seems structural. Google is fundamentally a search and general-use company. Their models are optimized across a massive range of tasks - translation, summarization, multimodal understanding, general conversation. Agentic software development is a narrow, specific workflow that requires its own focused training.

Training for agentic workflows means the model needs to complete long sequences of tool calls successfully. Recover gracefully from errors mid-sequence. Maintain context across many steps without drifting. This takes focused reinforcement learning on exactly that scenario, not just scaling up the base model.

Anthropic published research on agent autonomy showing that software engineering accounts for nearly 50% of all agentic activity on their API. Half their agentic usage is coding. When that's your reality, you train for it. You optimize the tool use, the file editing, the multi-step workflows - because that's what your paying users are actually doing. Google doesn't have that same pressure. Their model serves search, translation, multimodal tasks, general chat. Coding is one use case among dozens. Anthropic's model lives or dies by how well it codes.

Where Things Actually Stand

My honest assessment, as someone who uses these tools for real work every day:

Claude is my primary tool. Claude Code handles everything from scaffolding new features to debugging tricky production issues. The workflow is reliable enough that I can trust it on tasks I don't want to babysit.

Codex has gotten meaningfully better at agentic tasks. The gap has closed more than I expected over the past few months. It's not as reliable as Claude yet, but it's worth keeping an eye on.

Gemini is capable on isolated tasks. I've had it produce genuinely impressive code for well-specified problems. As an agentic system that operates independently on multi-step tasks, it still struggles. The loops, the getting stuck, the needing constant redirection - those are real, consistent failure modes that I hit regularly.

I've seen people try the "plan in one model, execute in another" approach. Use Gemini for architectural thinking, then switch to Claude for the actual work. In practice it adds friction without adding value. You might as well just stay in Claude for the whole thing.

What This Means for the Next Few Months

The benchmark leaders will keep changing. A new model will top the leaderboard. Developers will try it. Some will switch. Most will drift back.

The gap will narrow. Google has the resources to fix the process discipline problem if they decide it's the priority. OpenAI is clearly taking agentic workflows seriously with Codex. The advantage Claude has today isn't permanent.

But what Anthropic figured out - training for the workflow, not just the output - is a meaningful insight. Other labs will have to explicitly replicate that focus to close the gap. Bigger models alone won't do it. You can have the smartest model in the world, and it won't matter if it can't edit a file without breaking the one next to it.

The benchmarks will tell you one thing. The developers who use these tools every day will tell you another. Usually, you should listen to the developers.

Manish Bhusal

Software Developer from Nepal. 4x Hackathon Winner. Building digital products and learning in public.