我恨科学。

我恨科学。
I ****Ing Hate Science (2021)

原始链接: https://buttondown.com/hillelwayne/archive/i-ing-hate-science/

## 软件工程研究令人沮丧的现实作者同时支持经验软件工程（ESE）和形式化方法（FM），开始了一段令人沮丧的研究旅程，旨在回答一个简单的问题：在开发过程的早期发现的bug修复成本是否更低？他发现，研究领域出乎意料地不可靠。经常被引用的“事实”——例如，需求bug修复成本是后期的100倍——通常缺乏可追溯的来源，源于未经证实的说法或过时的研究。即使获取原始研究也很困难，受到付费墙的阻碍，以及学术数据库和术语的复杂性。此外，许多现有研究存在缺陷，依赖于过时的研究方法、狭窄的范围或可疑的统计分析。即使是“好的”论文，定义往往不够清晰，并且基于小型、特定的数据集。尽管存在这些挑战，作者仍然认为ESE具有价值，认识到一些发现——例如代码审查和短迭代周期的好处——得到了持续的支持。最终，虽然明确的答案仍然难以捉摸，但作者暂定结论是，后期bug *往往* 修复成本更高，特别是那些源于设计缺陷的bug。他强调，进行这项研究需要批判性思维和接受一定程度的不确定性，因为该领域是一个“巨大的、不连贯的悲伤混乱”。

## 现代科学的困境这场 Hacker News 的讨论源于一篇名为“我讨厌科学”的文章，围绕着科学研究和知识构建中的挑战展开。一个关键点是，不同领域自我纠错的难易程度：像直接的经验研究这样容易自我纠正的领域，与那些基础性错误可能导致整个研究方向脱轨的领域（如复杂的生化途径）形成对比。许多评论者强调了“科学”一词的细微差别，将现代、以方法为中心的定义与涵盖更广泛知识获取的旧概念进行对比——例如欧洲语言中不同的术语，如“Wissenschaft”。对话涉及学术界的激励机制问题，其中引用往往比追求真理更重要，导致了大量“垃圾堆”论文。一些人认为这是系统固有的，因为研究是一种有偿交付成果，而另一些人则认为 ConnectedPapers 等工具可以帮助梳理文献。最终，这场讨论承认了科学的优势——其自我纠错能力和严谨的方法论——同时也认识到其缺陷，以及科学界内部对批判性思维和道德原则的需求。它还简要涉及了科学无法满足的情感和社会需求，将其与宗教提供的慰藉进行对比。

原文

July 19, 2021

I'm a big advocate of Empirical Software Engineering. I wrote a talk on it. I wrote a 6000-word post covering one controversy. I spend a lot of time reading papers and talking to software researchers. ESE matters a lot to me.

I'm also a big advocate of formal methods (FM). I wrote a book on it, I'm helping run a conference on it, I professionally teach it for a living. There's almost no empirical evidence that FM helps us deliver software cheaper, because it's such a niche field and nobody's really studied it. But we can study a simpler claim: does catching software defects earlier in the project life cycle reduce the cost of fixing bugs? Someone asked me just that.

Which meant I'd have to actually dive into the research.

Hoo boy.

I've been dreading this. As much as I value empirical evidence, software research is also a train wreck where both trains were carrying napalm and tires.

Common Knowledge is Wrong

If you google "cost of a software bug" you will get tons of articles that say "bugs found in requirements are 100x cheaper than bugs found in implementations." They all use this chart from the "IBM Systems Sciences Institute":

There's one tiny problem with the IBM Systems Sciences Institute study: it doesn't exist. Laurent Bossavit did an exhaustive trawl and found that the ISSI, if it did exist, was an internal training program and not a research institute. As far as anybody knows, that chart is completely made up.

You also find a lot by Barry Boehm and COCOMO, which is based on research from the 1970s, and corruptions of Barry Boehm, which Bossavit tears down in his book on bad research. You also get lots of people who just make up hypothetical numbers, then other people citing those numbers as fact.

It's a standard problem with secondary sources: most of them aren't very good. They corrupt the actual primary information to advance their own agenda. If you want to get a lay of what the research actually says, you need to read the primary sources, or the papers that everybody's butchering.

But first you gotta find the primary sources.

Finding things is pain

The usual problem people raise with research is the cost: if you don't have an institutional subscription to a journal, reading a single paper can cost 40 bucks. If you're skimming dozens of papers, you're suddenly paying in the thousands just to learn "is planning good". Fortunately you can get around the paywalls with things like sci-hub. Alexandra Elbakyan has done more for society than the entire FSF. YEAH I WENT THERE

The bigger problem is finding the papers to read. General search engines have too much noise, academic search engines are terrible or siloed across a million different journals or both, and you don't know what to search. Like are you searching bug? Hah, newbie mistake! A good two-thirds of the papers are about defects. What's the difference between a "bug" and "defect"? Well, one's a bug and the other's a defect, duh!

I'm sure this is slightly easier if you're deeply embedded in academia. As an outsider, it feels like I'm trying to learn the intricacies of Byzantine fault tolerance without having ever touched a computer. Here's the only technique I've found that works, which I call scrobbling even though that means something totally different:

Search seed terms you know, like "cost of bugs", in an appropriate journal (here's one).
Find papers that look kinda relevant, skim their abstracts and conclusions.
Make a list of all the papers that either cite or are cited by these papers and repeat.
Find more useful terms and repeat.

Over time you slowly build out a list of good "node papers" (mostly literature reviews) and useful terms to speed up this process, but it's always gonna be super time consuming. Eventually you'll have a big messy mass of papers, most of which are entirely irrelevant, some of which are mostly irrelevant, a previous few are actually topical. Unfortunately, the only way to know which is which is to grind through them.

Most Papers are Useless

A lot will be from before 2000, before we had things like "Agile" and "unit tests" and "widespread version control", so you can't extrapolate any of their conclusions to what we're doing. As a rule of thumb I try to keep to papers after 2010.

Not that more recent papers are necessarily good! I mentioned earlier that most secondary sources are garbage. So are most primary sources. Doing science is hard and we're not very good at it! There are lots of ways to make a paper useless for our purposes.

Calculating bugs in changed files, but measuring program metrics across the whole project)
Base a lot of their calculations off sources that are complete garbage
Never clearly mention all of their data is exclusively on complex cyber-physical systems that 99.9999% of developers will never see
Accidentally including beginner test repos in your GitHub mining
Screw up the highly error prone statistical analysis, and then seventeenuple-count the same codebase for good measure

This doesn't even begin to cover the number of ways a paper can go wrong. Trust me, there are a lot. And the best part is that most of these errors are very subtle and the only way you can notice them is by carefully reading the paper. In some cases, you need a second team of researchers to find the errors the first time made. That's what happened in the last reference, where chronicling the whole affair took me several months. Just chronicling it, after the dust had settled. I can't imagine how much effort went into actually finding the errors.

Oh, and even if nobody can find any errors, the work might not replicate. Odds are you'll never find out, because the academic-industrial complex is set up to discourage replication studies. Long story short if you want to use a cite as evidence you need to carefully read it to make sure it's actually something you want.

Good papers are useless too

Well here's a paper that says inspection finds defects more easily in earlier phases! Except it doesn't distinguish between defect severity, so we have no idea if finding defects earlier is cheaper. But this paper measures cost-to-fix too, and finds there's no additional cost to fixing defects later! But all the projects in it used the heavyweight Team Software Process (TSP). And it contradicts this paper, which finds that design-level reviews find many more bugs that code level reviews… in a classroom setting.

Did I mention that all three of those papers use different definitions of "defect"? Could mean "something that causes the program to diverge from the necessary behavior", could mean a "misspelled word" (Vitharana, 2015). So we have that even good papers are working with small datasets, over narrow scopes, with conflicting results, and they can't even agree on what the words mean.

I normally call this "nuance". That is a very optimistic term. The pessimistic term is "a giant incoherent mess of sadness."

Nobody believes research anyway

The average developer thinks empirical software engineering is a waste of time. How can you possibly study something as complex as software engineering?! You've got different languages and projects and teams and experience levels and problem domains and constraints and timelines and everything else. Why should they believe your giant incoherent mess of sadness over their personal experience or their favorite speaker's logical arguments?

You don't study the research to convince others. You study the research because you're rather be technically correct than happy.

Why Bother?

Well, first of all, sometimes there is stuff that we can all agree on. Empirical research overwhelmingly shows that code review is a good way to find software bugs and spread software knowledge. It also shows that shorter iteration cycles and feedback loops lead to higher quality software than long lead times. Given how hedged and halting most empirical claims are, when everybody agrees on something, we should pay attention. Code review is good, fast feedback is good, sleep is good.

Second, there's a difference between ESE as a concept and ESE as practiced. I'm a big proponent in ESE, but I also believe that the academic incentive structures are not aligned a way that would give industry actionable information. There's much more incentive to create new models and introduce new innovations than do the necessary "gruntwork" that would be most useful: participant observation, manual compilation and classification of bugs, detailed case studies, etc. This is an example of the kinds of research I think is more useful. A team of researchers followed a single software team for three years and sat in on all of their sprint retrospectives. Even if the numbers don't translate to another organization, the general ideas are worth reading and reflecting on.

(Of course academia doesn't exist just to serve industry, and having cross-purpose incentives isn't necessarily a bad thing. But academics should at least make a conscious choice to do work that won't help the industry, as opposed to thinking their work is critical and wonder why nobody pays attention.)

Finally, even if the research is a giant incoherent mess of sadness, it's still possible to glean insights, as long as you accept that they'll be 30% research and 70% opinion. You can make inferences based on lots of small bits of indirect information, none of which is meaningful by itself but paints a picture as a whole.

Are Late-Stage Bugs More Expensive?

Oh yeah, the original question I was trying to answer. Kinda forgot about it. While there's no smoking gun, I think the body of research so far tentatively points in that direction, depending on how you interpret "late-stage", "bugs", and "more expensive". This is a newsletter, not a research paper, so I'll keep it all handwavey. Here's the rough approach I took to reach that conclusion:

Some bugs are more expensive than others. You can sort of imagine it being a Gaussian, or maybe a power law: most bugs are relatively cheap, a few are relatively expensive. We'd mine existing projects to create bug classifications, or we'd interview software developers to learn their experiences. Dewayne Perry did one of these analyses and found the bugs that took longest to fix (6 or more days) were things like feature interaction bugs and unacceptable global performance, in general stuff that's easier to catch in requirements and software modeling than in implementation.

I've checked a few other papers and think I'm tentatively confident in this line of reasoning: certain bugs take more time to fix (and cause more damage) than others, and said bugs tend to be issues in the design. I haven't vetted these papers to see if they don't make major mistakes, though. It's more time-consuming to do that when you're trying to synthesize an stance based on lots of indirect pieces of evidence. You're using a lot more papers but doing a lot less with each one, so the cost gets higher.

Anyway, I'm now 2000 words in and need to do other things today. tl;dr science is hard and researching it is hard and I really value empiricism while also recognizing how little quality work we've actually done in it.

TLA+ Conf schedule posted

Here! Even if you're not attending in person, we're looking to stream it live, so def check it out! I'll be giving a talk on tips and tricks to make spec writing easier. And there might be just a bit of showboating ;)

Update because this is popular

This was sent as part of an email newsletter; you can subscribe here. Common topics are software history, formal methods, the theory of software engineering, and silly research dives. I also have a website where I put my more polished and heavily-edited writing; newsletter is more for off-the-cuff writing. Updates are at least 1x a week.

Update

I'm seeing a lot of people read this and conclude that "software engineering" is inherently nonsensical and has nothing to do with engineering at all. Earlier this year I finished a large journalism project where I interviewed 17 "crossovers" who worked professionally as both "real" engineers and software developers. You can read the series here!

If you're reading this on the web, you can subscribe here. Updates are once a week. My main website is here.

My new book, Logic for Programmers, is now in early access! Get it here.

我恨科学。 I ****Ing Hate Science (2021)