软件测试的新时代

软件测试的新时代
A new era for software testing

尽管人工智能生成的代码在结构上偶尔不如顶尖的人工编写软件精巧，但其表现始终优于一般水平的人工开发。然而，大语言模型（LLM）真正的变革潜力在于软件质量保证（QA）与测试领域——在这一领域，它们能够在不牺牲质量的前提下，实现以往需要人工完成的复杂任务的自动化。传统的测试往往会忽略基于状态的边缘情况，且在时间投入和人力需求上受限。通过将大语言模型作为自主 QA 代理，开发人员可以执行复杂的端到端集成测试，例如验证分布式系统性能、对新提交的代码进行基准测试，或模拟长达数日的生产环境负载。这些代理甚至能从用户体验的角度评估软件，识别出人类测试员常忽略的未记录功能或“粗糙”的设计选择。通过将这些劳动密集型的人工流程转移给人工智能，开发人员能够显著提高新版本的质量门槛。归根结底，自动化 QA 是对自动编程所带来的快速、大批量代码生产的一种有力制衡，确保速度不会以牺牲长期可靠性为代价。

这篇 Hacker News 帖子探讨了“场景测试”（scenario testing）的优劣。这种测试方法主张从用户行为的角度而非内部代码逻辑出发来测试软件。支持者认为，传统的单元测试往往过于脆弱，一旦内部结构发生变化就需要持续维护，且难以捕捉真实的业务功能。场景测试侧重于用户可见的成果，即便底层实现经过重构也能保持稳定，因此能提供更高的可信度。批评者则提出了几点疑虑： * **熟悉度：** 许多人认为这只是对行为驱动开发（BDD）、Playwright 或 Cucumber 等现有概念的重新包装。 * **可靠性：** 持怀疑态度者担心放弃确定性的单元测试，认为基于场景的测试速度太慢、成本高昂且不可预测。 * **必要性：** 一些人认为单元测试为重构提供了必要的反馈信号，开发者有责任在组件层面验证代码。这场讨论突显了软件工程中持续存在的一种矛盾：在维护细粒度、确定性测试的负担，与通过端到端场景提供的实用、以用户为中心的信心之间，需要进行权衡。

antirez 4 days ago. 35783 views.

Automatic programming dramatically speeds up writing software in certain use cases and in the right hands. In my experience the output does not reach the structural quality and economy of complexity of the best hand-written software. However, not all the software is stellar, and my feeling is that automatic programming surpasses most of the times (and if well managed) the quality of decently developed hand-written code.

Yet, there is a tradeoff between quality and time, in the case of writing new software with AI. This tradeoff in certain projects I developed can be brutal, that is, completing projects that may take many months in a few weeks. However, there are domains where LLMs simply open new strictly more powerful ways to automate processes, without any compromise on quality. One of those domains is software QA and testing.

Traditionally software is tested using test suites that are composed of locally-scoped tests and integration tests (think of Redis: one thing is testing if SET foo 10 will be matched by GET foo => 10, another thing is testing if replication works in this case). And then by QA passes that are usually manually executed, and that can capture holes in the runnable test suite. It is a known fact that covering all the lines of the code does not mean covering all the possible states. Moreover integration testing is structurally hard: there are a number of timing issues, setups, and certain quality outputs that can only be visually inspected and not automatically checked that leave a lot of testing opportunities not really exploited because of time or logistic constraints.

LLMs offer a new way to do QA on top of the existing testing methodologies. The idea is to create a markdown file where an AI agent is asked to work as a QA engineer, performing a number of manual testings on the new release. For instance, in the case of DwarfStar (an inference engine for open weights LLMs) I use the following approach. In the markdown file, the agent is asked to check what are the new commits on top of the already released version of the software project. Then the model is told a list of things that should be performed, like:

- Check that distributed inference works across MacBook A and MacBook B, making sure the output is coherent, the inference works with all the GGUF files we have in both the machines, ...
- Make sure this release does not contain any speed regression.

And so forth. Notably, in the speed regression part, I don't have to tell the agent what was the previous expected speed, as this is a moving target that changes with new releases and new optimizations. Similarly the integration test for distributed inference does not require many instructions, at the start of the file there are just SSH endpoints and the key to use, the paths, and so forth.

The agent is asked to check the long list of QA activities *especially* in light of the added commits, starting with an inspection of the changes and with the identification of what could be affected, so that the QA pass specializes trying to find specific regressions.

In the case of Redis Arrays, I used a similar methodology asking the agent to build a large array-based Redis application, to setup a production environment with replication and persistency, to simulate the usage of the application for days and with many users, checking if something was odd.

Testing that uses these approaches may also move in the more psychological side of software quality, asking the agent to identify all the new features that may look surprising, not documented enough, or generally sloppy from the POV of the user. All things that needed to be executed manually before, and that most of the times were mostly skipped.

I have the feeling that the introduction of automatic QA may raise the bar of quality for new releases of software, and maybe partially compensate for the lower quality of the code produced at high speed with the use of automatic programming.

软件测试的新时代 A new era for software testing

软件测试的新时代
A new era for software testing