展示HN:我让Claude改进了200次代码库质量。
The highest quality codebase

原始链接: https://gricha.dev/blog/the-highest-quality-codebase

## AI 代码库“改进”实验 一位工程师的任务是让 Claude,一个 AI 代理,迭代地提高一个简单的食物宏量营养素估算应用程序的 2 万行 TypeScript 代码库的质量。通过一个循环 Claude 200 多次的脚本,提示语为“改进代码库质量”,实验结果令人惊讶——并且大部分是负面的。 在 36 小时内,代码库膨胀到 8.4 万行,测试代码从 1 万行增加到 6 万行。虽然类型检查得到了改进,并且注释从 1.5 千行增加到 1.87 万行,但 AI 优先考虑了代码的 *数量* 而不是实际质量。它痴迷地添加了诸如自定义 `Result` 类型(受 Rust 启发)、广泛的函数式编程工具,甚至哈希函数的熵检查等功能——基本上是重复造轮子,并创建了难以维护的代码。 AI 忽略了关键的端到端测试,而是专注于代码覆盖率和测试数量等虚荣指标。该实验强调了 AI 如何误解“质量”,并表明需要一种更细致、两步式的方法——总结然后重新实现——才能真正完善一个项目。尽管一片混乱,该应用程序仍然可以运行,尽管出现了一些新的错误。

相关文章

原文

Have you seen one of the experiments where people have been re-feeding the same image to the AI agent a bunch of times?

Or Marques Brownlee's youtube videos where the video is reuploaded a 1000 times?

Over the Thanksgiving weekend I had some time on my hands and tasked Claude to write an app that guestimates macronutrients in some foods based on description + photo. There's some interesting setup in getting it right, but that's boring. It has created a great, functional app for me, but then I forced it to do a small, evil experiment for me.

I've written a quick script that looped over my codebase and ran this command.

#!/usr/bin/env bash

set -euo pipefail

PROMPT="Ultrathink. You're a principal engineer. Do not ask me any questions. We need to improve the quality of this codebase.  Implement improvements to codebase quality."
MAX_ITERS="200"

for i in $(seq 1 "$MAX_ITERS"); do
  claude --dangerously-skip-permissions -p "$PROMPT"

  git add -A

  if git diff --cached --quiet; then
    echo "No changes this round, skipping commit."
  else
    git commit --no-verify -m "yolo run #$i: $PROMPT"
  fi
done

...and havoc it wrecked. Over 200 times of unmitigated madness. I have tweaked the prompt here and there when I've been seeing it overindexing on a single thing, but with enough iterations it started covering a lot of ground.. from full code coverage and more tests than functional code, to rust-style Result types, to.. estimating entropy of hashing function (???).

This was running for around 36 hours and took me some time to grok through, but let's see what it did. The entire repo is here btw. The branch you're looking for is highest-quality.

The app

This app is around 4-5 screens. Take a photo, add description, get AI response. Simple as that.

Pure numbers

The version "pre improving quality" was already pretty large. We are talking around 20k lines of TS, around 9.7k is in various __tests__ directories. This was slightly intentional - when working with Claude Code, having good self-validation harness greatly improves the quality of results.

cloc . --exclude-dir=node_modules,dist,build,.expo,.husky,.maestro,Pods
     132 text files.
     127 unique files.
      11 files ignored.
github.com/AlDanial/cloc v 2.04  T=0.11 s (1167.4 files/s, 487085.6 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
JSON                             4              0              0          23733
TypeScript                      99           3019           1541          20160
Markdown                        11           1004              0           2700
JavaScript                       9             26             51            269
Bourne Shell                     2             34             41            213
YAML                             2             35              2            162
-------------------------------------------------------------------------------
SUM:                           127           4118           1635          47237
-------------------------------------------------------------------------------

But in the aftermath - 84 thousand! We went 20k -> 84k on "improvements" to the quality of the codebase.

 cloc . --exclude-dir=node_modules,dist,build,.expo,.husky,.maestro,Pods
     285 text files.
     281 unique files.
      10 files ignored.
github.com/AlDanial/cloc v 2.04  T=0.60 s (468.1 files/s, 268654.5 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
TypeScript                     247          17587          18749          84185
JSON                             5              0              0          24863
Markdown                        14           4151              0          10391
JavaScript                       9             41            140            598
Bourne Shell                     3             41             41            228
YAML                             3             50              3            215
-------------------------------------------------------------------------------
SUM:                           281          21870          18933         120480
-------------------------------------------------------------------------------

Tests alone went from 10k to 60k LOC!

cloc . \
  --exclude-dir=node_modules,dist,build,.expo,.husky,.maestro,Pods \
  --match-d='__tests__'
     138 text files.
     138 unique files.
       1 file ignored.
github.com/AlDanial/cloc v 2.04  T=0.23 s (612.9 files/s, 346313.3 lines/s)
-------------------------------------------------------------------------------
Language                     files          blank        comment           code
-------------------------------------------------------------------------------
TypeScript                     138          13919           3685          60366
-------------------------------------------------------------------------------
SUM:                           138          13919           3685          60366
-------------------------------------------------------------------------------

I feel much safer.

We went from around 700 to a whooping 5369 tests. In the original project I had e2e tests using actual simulator - they are pretty important to make sure that the coding agent has closed feedback loop, but in the process of improving the quality they seemed to have been forgotten ¯\_(ツ)_/¯.

Btw. we went from ~1500 lines of comments to 18.7k.

OK, but what did it actually do? I have the full log of what Claude Code was outputting in the summary after every run in. You can check it here

Not-Invented-Here

Claude Code really didn't like using 3rd party libraries and created a ton of random utilities.

I can sort of respect that the dependency list is pretty small, but at the cost of very unmaintainable 20k+ lines of utilities. I guess it really wanted to avoid supply-chain attacks.

Some of them are really unnecessary and could be replaced with off the shelf solution:

  • Full on hierarchical logger with built in performance tracking instead of using something simple off the shelf lib/logger.ts
  • React Hooks. Some of them are specific to our use-case, but bunch of them really doesn't have to be reinvented (or invented in the first place).

Some are just insane - here are my favorites!

  • The Result Type implementation lib/result.ts - This module provides a Result type (similar to Rust's Result<T, E>).

I like Rust's result-handling system, I don't think it works very well if you try to bring it to the entire ecosystem that already is standardized on error throwing. In my previous job we experimented with doing that in Python. It wasn't clicking very well with people and using it felt pretty forced. I'd stray away from it.

This made me giggle because of course AI started bringing patterns from Rust. There's lib/option.ts too.

Infra

In some iterations, coding agent put on a hat of security engineer. For instance - it created a hasMinimalEntropy function meant to "detect obviously fake keys with low character variety". I don't know why.

To ensure we have proper scalability it has implemented circuit breaking and jittering exponential backoff. The only API we are talking to is OpenAI/Anthropic. You're welcome.

The positive - there's been a lot of time spent on making sure that we have strict type checking, we don't overly cast (as any as T) and, hey, I respect that.

The success criteria - Quality Metrics

The prompt, in all its versions, always focuses on us improving the codebase quality. It was disappointing to see how that metric is perceived by AI agent. The leading principle was to define a few vanity metrics and push for "more is better".

In message log, the agent often boasts about the number of tests added, or that code coverage (ugh) is over some arbitrary percentage. We end up with an absolute moloch of unmaintainable code in the name of quality. But hey, the number is going up.

All in all, the project has more code to maintained, most of it largely useless. Tons of tests got added, but some tests that mattered the most (maestro e2e tests that validated the app still works) were forgotten. It had some moments of "goodness", like making sure typechecks are of high quality.

To truly resemble the test of "redraw this image 1000 times"/"reupload this video 1000 times", I think the loop would have to be two step:

  • Read and summarize the project
  • Implement a fresh project based off of this description

This was obviously done in jest, I didn't expect that this will improve the quality of codebase in the ways that I think truly matters. I've prompted Claude Code to failure here and it definitely produced some funny results.

I still use coding agents for my day to day development. If anything it feels like time spent reviewing AI code was not a waste of time.

...oh and the app still works, there's no new features, and just a few new bugs.

联系我们 contact @ memedata.com