I use AI coding tools every day. Claude Code for most of my actual work. I've tried the alternatives - Gemini, Codex, open-source models. I keep coming back. Not because of loyalty. Not because of marketing. Because the alternatives keep failing me in the same specific way.
A new model drops. It tops the benchmarks. Developers try it. Developers complain. They go back to Claude. This has happened three or four times now, and the pattern is consistent enough that it deserves an explanation.
Benchmarks Are Not Lying. But They're Not Telling You What You Think.
When a new AI model tops the coding benchmarks, the benchmarks are usually accurate. The model genuinely produces better code on isolated problems. Higher accuracy on HumanEval. Cleaner solutions on LeetCode-style tasks. The numbers are real.
Older benchmarks like HumanEval work exactly like that - you get an isolated function to write, and you're graded on whether it passes unit tests. Newer benchmarks like SWE-bench are more realistic. They give the model real GitHub issues from real repos and ask it to generate patches. That's closer to actual development.
But even SWE-bench is still a controlled environment. Real coding work has more going on. You're managing a conversation with the user. You're deciding which files to read and which to skip. You're making targeted edits without breaking surrounding code. You're hitting unexpected errors and deciding whether to ask for help or try a different approach. You're staying on task across 20+ steps without drifting. That kind of sustained, interactive workflow is hard to capture in any benchmark.
The Process vs. Raw Intelligence Gap
The most useful frame I've found for understanding this: Anthropic appears to have trained Claude heavily on the process of coding, not just the output. The workflow. The sequence of decisions a competent developer actually makes when given a task in a real codebase.
To be clear - every major coding agent can read files, edit code, and run terminal commands. Codex, Antigravity, Gemini CLI - they all have these capabilities. The difference is in how consistently the model behind them executes the workflow. Reading the right files before making changes. Making targeted edits instead of rewriting entire files unnecessarily. Knowing when to act and when to stop and ask. Staying on the original task instead of getting distracted.
All these tools can do it. Claude does it more reliably. Other models produce excellent code - sometimes arguably better than Claude's on a per-snippet basis. The gap isn't in any individual output. It's in the consistency across a full task. They loop more often. They lose track of what they were doing mid-sequence. They make edits that break surrounding context. They need more steering to stay on track. Not always - but often enough that it changes how much you can trust the tool to work unsupervised.
The difference isn't raw intelligence. It's process discipline. And that's harder to train for than most people realize.
What "Good at Coding" Actually Requires
Generating correct code is maybe 40% of what an AI coding assistant needs to do well. The other 60% is everything around the code:
- Editing files without corrupting surrounding code
- Reading the right files before making changes
- Completing a multi-step task without losing the thread halfway through
- Communicating clearly about what it's doing and what it found
- Knowing when to ask instead of assuming
- Staying on task instead of making unrequested changes to unrelated files
Every major coding agent attempts all of these. The question is how often they succeed at each one across a full task. In my experience using Claude Code daily - building API endpoints, debugging production issues, refactoring components - it hits these consistently. Not perfectly, but consistently enough that I don't feel like I need to watch every step.
With other tools, I find myself intervening more. The code they generate is often just as good. But somewhere in the middle of a multi-file task, something slips - a file gets partially overwritten, or the model goes off and starts "improving" something I didn't ask about. That's the gap. It's not about capability. It's about how often the tool stays on track without you having to course-correct.
Why Google Has a Structural Problem Here
I want to be fair. Gemini writes excellent code. The underlying model is clearly very capable. Give it a well-contained problem with a clear spec and it'll produce a good solution. Sometimes a great one.
The problem seems structural. Google is fundamentally a search and general-use company. Their models are optimized across a massive range of tasks - translation, summarization, multimodal understanding, general conversation. Agentic software development is a narrow, specific workflow that requires its own focused training.
Training for agentic workflows means the model needs to complete long sequences of tool calls successfully. Recover gracefully from errors mid-sequence. Maintain context across many steps without drifting. This takes focused reinforcement learning on exactly that scenario, not just scaling up the base model.
Anthropic published research on agent autonomy showing that software engineering accounts for nearly 50% of all agentic activity on their API. Half their agentic usage is coding. When that's your reality, you train for it. You optimize the tool use, the file editing, the multi-step workflows - because that's what your paying users are actually doing. Google doesn't have that same pressure. Their model serves search, translation, multimodal tasks, general chat. Coding is one use case among dozens. Anthropic's model lives or dies by how well it codes.
Where Things Actually Stand
My honest assessment, as someone who uses these tools for real work every day:
Claude is my primary tool. Claude Code handles everything from scaffolding new features to debugging tricky production issues. The workflow is reliable enough that I can trust it on tasks I don't want to babysit.
Codex has gotten meaningfully better at agentic tasks. The gap has closed more than I expected over the past few months. It's not as reliable as Claude yet, but it's worth keeping an eye on.
Gemini is capable on isolated tasks. I've had it produce genuinely impressive code for well-specified problems. As an agentic system that operates independently on multi-step tasks, it still struggles. The loops, the getting stuck, the needing constant redirection - those are real, consistent failure modes that I hit regularly.
I've seen people try the "plan in one model, execute in another" approach. Use Gemini for architectural thinking, then switch to Claude for the actual work. In practice it adds friction without adding value. You might as well just stay in Claude for the whole thing.
What This Means for the Next Few Months
The benchmark leaders will keep changing. A new model will top the leaderboard. Developers will try it. Some will switch. Most will drift back.
The gap will narrow. Google has the resources to fix the process discipline problem if they decide it's the priority. OpenAI is clearly taking agentic workflows seriously with Codex. The advantage Claude has today isn't permanent.
But what Anthropic figured out - training for the workflow, not just the output - is a meaningful insight. Other labs will have to explicitly replicate that focus to close the gap. Bigger models alone won't do it. You can have the smartest model in the world, and it won't matter if it can't edit a file without breaking the one next to it.
The benchmarks will tell you one thing. The developers who use these tools every day will tell you another. Usually, you should listen to the developers.