TL;DR. Claude would rather reinvent the wheel than pip install one.
I wanted to fix typos on some Fandom wikis. Opened Claude Code, Opus 4.7. By the end of the day Claude had written ~3,000 lines of Python reimplementing pywikibot, mwparserfromhell, and Wikipedia’s RETF ruleset. It didn’t web search for prior art once.
What got built vs. what existed
| Component | What Claude wrote | What was on PyPI |
|---|---|---|
| Wikitext stripper | 122 lines of regex, with cases for nested templates, <nowiki>, <pre>, <ref> with templates, color tags | mwparserfromhell.parse(text).strip_code() |
| Typo dictionary | 18 entries (teh→the, recieve→receive, occured→occurred, …) | RETF, ~4,000 rules, community-maintained since 2007 |
| Edit runner | 10 copies, ~250 LOC each. Cookie auth, raw CSRF fetch, maxlag backoff, conflict retry | pywikibot.Page.save(). The migrated version is 8 lines. |
| Cosmetic fixes | Bespoke patterns I never asked for | pywikibot/scripts/cosmetic_changes.py, shipped since ~2010 |
| Wiki family config | 13 hand-rolled SiteDefinitions in a families/ directory | pywikibot/families/*.py, ships upstream |
I spent the day debugging trivial bugs in the hand-rolled stripper. ASCII art bleeding into matches, code blocks getting tokenized. Every bug got patched with another regex case. Not once did Claude stop to ask whether a parser existed.
Then I told it to migrate
Two minutes of Google had given me links to all three libraries. By midnight lib/ was down from ~3,000 lines to 1,259. The stripper became a shim over mwparserfromhell. The ten edit runners collapsed into one shim over pywikibot. RETF rules got fetched at runtime.
And then Claude argued to keep the typo dictionary.
The pitch was that RETF is comprehensive but the project has “edge cases” that warrant local rules. All 18 entries were already in RETF. Several were written worse. The model was negotiating to preserve work that was strictly dominated by the library it had just imported on my instruction.
Why this happens
I don’t have a clean answer but here’s what I’d guess.
The benchmarks punish the right behavior. Some public coding benchmarks run sealed. No network, no pip install, no web search. The only way to score is to write the code yourself. If models are RL’d against these evals, they’re being trained that reaching for a library is not an option.
Sunk-cost defense. Once 3,000 lines exist in context, the model treats them as load-bearing. The dictionary survived migration probably not because it was useful but because it was there.
I’ve seen the same pattern elsewhere. Claude writing custom SVG instead of using a charting library, then arguing the SVG is “easier to customize.” It isn’t.