我如何使用编码代理编写JustHTML,一个基于Python的HTML5解析器
How I wrote JustHTML, a Python-based HTML5 parser, using coding agents

原始链接: https://friendlybit.com/python/writing-justhtml-with-coding-agents/

## JustHTML:利用人工智能辅助构建HTML5解析器 JustHTML是一个全新的、无依赖的Python HTML5解析器,在严格的html5lib测试套件上实现了100%的通过率,并包含CSS选择器查询API。该项目展示了编码代理的强大功能,使用VS Code和Github Copilot构建,并强调了解析现实世界中经常存在错误的HTML的挑战。 开发过程涉及迭代改进,从一个基本的解析器开始,并逐步提高测试覆盖率。一个关键的障碍是实现复杂的“收养机构算法”来处理格式错误的HTML——即使是原始Firefox HTML5解析器作者也认为这项任务具有挑战性。最初的性能较慢,导致了Rust分词器的重写(收益有限),最终将来自快速Rust解析器html5ever的逻辑移植回Python。 尽管最终速度比html5lib慢,但广泛的性能分析、受测试覆盖率指导的代码删除以及模糊测试显著提高了性能。作者强调代理*编写*了代码,而他们专注于高级设计、错误纠正和引导流程。该项目强调了在人工智能辅助开发中,明确的目标、代码审查以及允许代理从失败中学习的价值。

## JustHTML:一个使用编码代理构建的Python HTML5解析器 Emil Stenström 最近创建了 JustHTML,一个 Python 库,用大约 3,000 行代码实现了一个完全符合 HTML5 标准的解析器——通过了所有 9,200 个 HTML5 兼容性测试。该项目历时几个月,使用了各种编码代理工具开发,展示了它们通过迭代测试来解决复杂任务的潜力。 最初,JustHTML 是从头开始构建的,实现了完全的测试覆盖。为了加速开发,Stenström 随后利用编码代理重写了解析器,其代码结构基于 Rust 库 `html5ever`,但由于 Rust 的优化,这并非直接移植。据报道,生成的版本比 `html5lib` 快 60%。 该项目突出了编码代理在存在明确“正确/错误”答案的情况下的有效性,例如解析。该库采用 MIT 许可,允许进一步开发和调整,包括潜在的应用案例,例如在 PostgreSQL 数据库中清理 RSS 订阅源。作者还根据反馈更新了关于该项目的博客文章,提高了可读性。
相关文章

原文

I recently released JustHTML, a python-based HTML5 parser. It passes 100% of the html5lib test suite, has zero dependencies, and includes a CSS selector query API. Writing it taught me a lot about how to work with coding agents effectively.

I thought I knew HTML going into this project, but it turns out I know nothing when it comes to parsing broken HTML5 code. That's the majority of the algorithm.

Henri Sivonen, who implemented the HTML5 parser for Firefox, called the "adoption agency algorithm" (which handles misnested formatting elements) "the most complicated part of the tree builder". It involves a "Noah's Ark" clause (limiting identical elements to 3) and complex stack manipulation that breaks the standard stack model.

I still don't know how to solve those problems. But I still have a parser that solves those problems better than the reference implementation html5lib. Power of AI! :)

Why HTML5?#

When picking a project to build with coding agents, choosing one that already has a lot of tests is a great idea. HTML5 is extremely well-specified, with a long specification and thousands of treebuilder and tokenizer tests available in the html5lib-tests repository.

When using coding agents autonomously, you need a way for them to understand their own progress. A complete test suite is perfect for that. The agent can run the tests, see what failed, and iterate until they pass.

Building the parser (iterations, restarts, and performance work)#

Writing a full HTML5 parser is not a short one-shot problem. I have been working on this project for a couple of months on off-hours.

Tooling: I used plain VS Code with Github Copilot in Agent mode. I enabled automatic approval of all commands, and then added a blacklist of commands that I always wanted to approve manually. I wrote an agent instruction that told it to keep working, and don't stop to ask questions. Worked well!

Here is the process it took to get here:

A one-shot HTML5 parser (as a baseline)#

To begin, I asked the agent to write a super-basic one-shot HTML5 parser. It didn't work very well, but it was a start.

Wiring up html5lib-tests (<1% pass rate)#

Next, I wired up the html5lib-tests and saw that we had a <1% pass rate. Yes, those tests are hard. They are the gold standard for HTML5 parsing, containing thousands of edge cases like:

Iterating to ~30% coverage (refactors and bugfixes)#

After that, we started iterating, slowly climbing to about 30% pass rate. This involved a lot of refactoring and fixing small bugs.

Refactoring into per-tag handlers#

Once I could see the shape of the problem, I decided I liked a handler-based structure, where each tag gets its own handler. Modular structure ftw! I asked the agent to refactor and it did.

class TagHandler:
    """Base class for all tag handlers."""
    def handle_start(self, context, token):
        pass

class UnifiedCommentHandler(TagHandler):
    """Handles comments in all states."""
    def handle_start(self, context, token):
        context.insert_comment(token.data)

Reaching 100% test coverage (with better models)#

From there, we continued iterating to 100% test coverage. This took a long time, and the Claude Sonnet 3.7 release was the reason we got anywhere at all.

Benchmarking and discovering we were 3x slower#

With correctness handled, I set up a benchmark to test how fast my parser was. I saw that I was 3x slower than html5lib, which is already considered slow.

Rewriting the tokenizer in Rust (and barely matching html5lib)#

So I tried the obvious next move: I let an agent rewrite the tokenizer in Rust to speed things up (note: I don't know Rust). It worked, and the speed barely passed html5lib. It created a whole rust_tokenizer crate with 690 lines of Rust code in lib.rs that I couldn't read, but it passed the tests.

Discovering html5ever (fast, correct, Rust)#

While looking for alternatives, I found html5ever, Servo's parsing engine. It is very correct and written from scratch in Rust to be fast.

Asking: why build this at all?#

At that point I had the uncomfortable thought: why would the world need a slower version of html5ever in partial Python? What is the meaning of it all?! I almost just deleted the whole project.

Pivoting to porting html5ever logic to Python#

Instead of quitting, I considered writing a Python interface against html5ever, but decided I didn't like the hassle of a library requiring installing binary files. So I went pure Python again, but with a faster approach: what if I port the html5ever logic to Python? Shouldn't that be faster than the existing Python libraries? I decided to throw all previous work away.

Restarting from scratch (again)#

So I started over from <1% test coverage and iterated with the same set of tests all the way up to 100%. This time I asked it to cross reference the Rust codebase in the beginning. It was tedious work, doing the same thing over again.

Still slower than html5lib#

Unfortunately, I ran the benchmark on the new codebase and found that it was still slower than html5lib.

Profiling, real-world benchmarks, and micro-optimizations#

So I switched to performance work: I wrote some new tools for the agents to use, a simple profiler and a scraper that built a dataset of 100k popular webpages for real-world benchmarking. I managed to get the speed down below the target with Python micro-optimizations, but only when using the just-released Gemini 3 Pro (which is incredible) to run the benchmark and profiler iteratively. No other model made any progress on the benchmarks.

def _append_text_chunk(self, chunk, *, ends_with_cr=False):
    if not chunk:
        self.ignore_lf = ends_with_cr
        return
    if self.ignore_lf:
        if chunk[0] == "\n":
            chunk = chunk[1:]
            # ...

Deleting untested code (coverage as a scalpel)#

Later, on a whim I ran coverage on the codebase and found that large parts of the code were "untested". But this was backwards, because I already knew that the tests were covering everything important. So lines with no test coverage could be removed! I told the agent to start removing code to reach 100% test coverage, which was an interesting reversal of roles. These removals actually sped up the code as much as the microoptimizations.

# Before: 786 lines of treebuilder code
# After: 453 lines of treebuilder code
# Result: Faster and cleaner

Fuzzing to find crashes and harden the parser#

After removing code, I got worried that I had removed too much and missed corner cases. So I asked the agent to write a html5 fuzzer that tried really hard to generate HTML that broke the parser.

def generate_fuzzed_html():
    """Generate a complete fuzzed HTML document."""
    parts = []
    if random.random() < 0.5:
        parts.append(fuzz_doctype())
    # Generate random mix of elements
    num_elements = random.randint(1, 20)
    # ...

It did break the parser, and for each breaking case I asked it to fix it, and write a new test for the test suite. Passed 3 million generated webpages without any crashes, and hardened the codebase again.

Comparing against other parsers (how rare 100% is)#

To sanity-check where 100% landed, I ran the html5lib tests against the other parsers. I found that no other parser passes 90% coverage, and that lxml, one of the most popular Python parsers, is at 1%. The reference implementation, html5lib itself, is at 88%. Maybe this is a hard problem after all?

Shipping it as a library (CI, releases, selector API)#

Finally, to make this a good library I asked the agent to set up CI, releases via GitHub, a query API, write READMEs, and so on.

from justhtml import JustHTML, query

doc = JustHTML("<div><p>Hello</p></div>")
elements = query(doc, "div > p")

Decided to rename the library from turbohtml to justhtml, to not fool anyone that it's the fastest library, and instead focus on the feeling of everything just working.

What the agent did vs. what I did#

After writing the parser, I still don't know HTML5 properly. The agent wrote it for me. I guided it when it came to API design and corrected bad decisions at the high level, but it did ALL of the gruntwork and wrote all of the code.

I handled all git commits myself, reviewing code as it went in. I didn't understand all the algorithmic choices, but I understood when it didn't do the right thing.

As models have gotten better, I've seen steady increases in test coverage. Gemini is the smartest model from a one-shot perspective, while Claude Opus is best at iterating its way to a good solution.

Practical tips for working with coding agents#

  1. Start with a clear, measurable goal. "Make the tests pass" is better than "improve the code."
  2. Review the changes. The agent writes a lot of code. Read it. You'll catch issues and learn things.
  3. Push back. If something feels wrong, say so. "I don't like that" is a valid response.
  4. Use version control. If the agent goes in the wrong direction, you can always revert.
  5. Let it fail. Running a command that fails teaches the agent something. Don't try to prevent all errors upfront.

Was it worth it (and what “quickly” meant)?#

Yes. JustHTML is about 3,000 lines of Python with 8,500+ tests passing. I couldn't have written it this quickly without the agent.

But "quickly" doesn't mean "without thinking." I spent a lot of time reviewing code, making design decisions, and steering the agent in the right direction. The agent did the typing; I did the thinking.

That's probably the right division of labor.

联系我们 contact @ memedata.com