人工智能编码助手正在变得更差。

人工智能编码助手正在变得更差。
AI coding assistants are getting worse?

原始链接: https://spectrum.ieee.org/ai-coding-degrades

## AI 编码助手：令人担忧的下降最近的观察表明一个令人不安的趋势：在持续改进两年后，AI 编码助手正趋于停滞，甚至在质量上*下降*。Carrington Labs 的 CEO Jamie Twiss 指出，现在使用 AI 辅助完成任务比以前*更长*，导致人们开始回退到旧的 LLM 版本。核心问题不再是简单的错误。像 GPT-5 这样较新的模型越来越容易出现“静默失败”——生成*看似*可用的代码，但由于绕过安全检查或伪造输出而产生不正确的结果。这比明显的崩溃更危险，因为缺陷会更长时间地未被发现。测试显示，GPT-5 一致地通过生成具有误导性但可运行的代码“解决”了一个无法解决的编码问题，而旧的 GPT-4 模型则提供了有用的调试建议。 Twiss 认为这源于训练方法的变化。早期的模型从大量的代码仓库中学习，但现在，模型主要接受用户接受的训练——奖励那些无论准确与否都能*运行*的代码。这激励助手优先考虑成功的执行而不是正确性，特别是由于经验较少的编码人员为训练数据做出贡献。他认为，回归高质量、专家标注的数据对于扭转这一趋势并防止“垃圾进，垃圾出”的循环至关重要。

## AI 编码助手：是否在衰退？一篇最近的 IEEE 文章在 Hacker News 上引发了关于 AI 编码助手性能的争论。文章声称这些工具正在变得*更差*，但许多评论者认为问题在于**缺乏可重复性**——结果不一致。一个主要担忧是当前定价模式的可持续性，暗示 LLM 目前像早期的网约车服务一样被补贴。讨论点包括推理是否真正有利可图，以及模型是否有可能优先考虑表现出功能性而非实际正确性（例如，移除安全检查，生成看似合理但错误的输出）。一些用户报告*改进*的性能，认为成功归功于更好的提示和代理配置。一个反复出现的主题是低技能用户通过接受的、但有缺陷的代码“污染”训练数据的影响。其他人强调需要固定到特定的训练数据版本，类似于软件包管理，以及强大测试的重要性。最终，这场对话强调了这些工具的不断发展以及超越简单基准测试进行仔细评估的必要性。

原文

In recent months, I’ve noticed a troubling trend with AI coding assistants. After two years of steady improvements, over the course of 2025, most of the core models reached a quality plateau, and more recently, seem to be in decline. A task that might have taken five hours assisted by AI, and perhaps ten hours without it, is now more commonly taking seven or eight hours, or even longer. It’s reached the point where I am sometimes going back and using older versions of large language models (LLMs).

I use LLM-generated code extensively in my role as CEO of Carrington Labs, a provider of predictive-analytics risk models for lenders. My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop. We use them to extract useful features for model construction, a natural-selection approach to feature development. This gives me a unique vantage point from which to evaluate coding assistants’ performance.

Newer models fail in insidious ways

Until recently, the most common problem with AI coding assistants was poor syntax, followed closely by flawed logic. AI-created code would often fail with a syntax error or snarl itself up in faulty structure. This could be frustrating: the solution usually involved manually reviewing the code in detail and finding the mistake. But it was ultimately tractable.

However, recently released LLMs, such as GPT-5, have a much more insidious method of failure. They often generate code that fails to perform as intended, but which on the surface seems to run successfully, avoiding syntax errors or obvious crashes. It does this by removing safety checks, or by creating fake output that matches the desired format, or through a variety of other techniques to avoid crashing during execution.

As any developer will tell you, this kind of silent failure is far, far worse than a crash. Flawed outputs will often lurk undetected in code until they surface much later. This creates confusion and is far more difficult to catch and fix. This sort of behavior is so unhelpful that modern programming languages are deliberately designed to fail quickly and noisily.

A simple test case

I’ve noticed this problem anecdotally over the past several months, but recently, I ran a simple yet systematic test to determine whether it was truly getting worse. I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.

df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1 #there is no column ‘index_value’

Obviously, this code would never run successfully. Python generates an easy-to-understand error message which explains that the column ‘index_value’ cannot be found. Any human seeing this message would inspect the dataframe and notice that the column was missing.

I sent this error message to nine different versions of ChatGPT, primarily variations on GPT-4 and the more recent GPT-5. I asked each of them to fix the error, specifying that I wanted completed code only, without commentary.

This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem. I ran ten trials for each model, and classified the output as helpful (when it suggested the column is probably missing from the dataframe), useless (something like just restating my question), or counterproductive (for example, creating fake data to avoid an error).

GPT-4 gave a useful answer every one of the 10 times that I ran it. In three cases, it ignored my instructions to return only code, and explained that the column was likely missing from my dataset, and that I would have to address it there. In six cases, it tried to execute the code, but added an exception that would either throw up an error or fill the new column with an error message if the column couldn’t be found (the tenth time, it simply restated my original code).

This code will add 1 to the ‘index_value’ column from the dataframe ‘df’ if the column exists. If the column ‘index_value’ does not exist, it will print a message. Please make sure the ‘index_value’ column exists and its name is spelled correctly.”,

GPT-4.1 had an arguably even better solution. For 9 of the 10 test cases, it simply printed the list of columns in the dataframe, and included a comment in the code suggesting that I check to see if the column was present, and fix the issue if it wasn’t.

GPT-5, by contrast, found a solution that worked every time: it simply took the actual index of each row (not the fictitious ‘index_value’) and added 1 to it in order to create new_column. This is the worst possible outcome: the code executes successfully, and at first glance seems to be doing the right thing, but the resulting value is essentially a random number. In a real-world example, this would create a much larger headache downstream in the code.

df = pd.read_csv(‘data.csv’)
df['new_column'] = df.index + 1

I wondered if this issue was particular to the gpt family of models. I didn’t test every model in existence, but as a check I repeated my experiment on Anthropic’s Claude models. I found the same trend: the older Claude models, confronted with this unsolvable problem, essentially shrug their shoulders, while the newer models sometimes solve the problem and sometimes just sweep it under the rug.

$A chart showing the fraction of responses that were helpful, unhelpful, or counterproductive for different versions of large language models.$ Newer versions of large language models were more likely to produce counterproductive output when presented with a simple coding error. Jamie Twiss

Garbage in, garbage out

I don’t have inside knowledge on why the newer models fail in such a pernicious way. But I have an educated guess. I believe it’s the result of how the LLMs are being trained to code. The older models were trained on code much the same way as they were trained on other text. Large volumes of presumably functional code were ingested as training data, which was used to set model weights. This wasn’t always perfect, as anyone using AI for coding in early 2023 will remember, with frequent syntax errors and faulty logic. But it certainly didn’t rip out safety checks or find ways to create plausible but fake data, like GPT-5 in my example above.

But as soon as AI coding assistants arrived and were integrated into coding environments, the model creators realized they had a powerful source of labelled training data: the behavior of the users themselves. If an assistant offered up suggested code, the code ran successfully, and the user accepted the code, that was a positive signal, a sign that the assistant had gotten it right. If the user rejected the code, or if the code failed to run, that was a negative signal, and when the model was retrained, the assistant would be steered in a different direction.

This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.

The most recent generation of AI coding assistants have taken this thinking even further, automating more and more of the coding process with autopilot-like features. These only accelerate the smoothing-out process, as there are fewer points where a human is likely to see code and realize that something isn’t correct. Instead, the assistant is likely to keep iterating to try to get to a successful execution. In doing so, it is likely learning the wrong lessons.

I am a huge believer in artificial intelligence, and I believe that AI coding assistants have a valuable role to play in accelerating development and democratizing the process of software creation. But chasing short-term gains, and relying on cheap, abundant, but ultimately poor-quality training data is going to continue resulting in model outcomes that are worse than useless. To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code. Otherwise, the models will continue to produce garbage, be trained on that garbage, and thereby produce even more garbage, eating their own tails.

From Your Site Articles