五十万“带空格的词”缺失于词典。
Half million 'Words with Spaces' missing from dictionaries

原始链接: https://www.linguabase.org/words-with-spaces.html

## 英语中缺失的词汇 传统字典忽略了大量的常用短语——大约五十万个——仅仅是因为它们包含词与词之间的空格。这源于历史上对单个词汇而非多词表达(MWEs),例如“boiling water”(开水)或“red tape”(官僚主义)的关注。虽然像Wiktionary这样的资源提供了更广泛的覆盖,但它们仍然遗漏了这些常用组合的重要部分。 随着短语变得不那么常见,覆盖率会急剧下降;默riam-Webster字典仅涵盖了前10,000个MWEs中的18%,到100,000个时则降至7%。许多看似显而易见的短语——例如“hospital bills”(医院账单)——被遗漏,而另一些具有不太可预测的含义的短语(“cold feet”(临阵退缩))则被收录。 英语拥有大约300亿个有意义的两个词组合的可能性,其中许多在各自组成部分之外发挥着概念单元的作用。作者认为这些短语应该被视为词汇,强调了字典定义与我们实际*使用*语言方式之间的脱节。这影响了诸如文字游戏之类的应用,揭示了字典历史上一直忽略的隐藏“词汇”储备。

最近一篇Hacker News上的帖子提到一个项目 (linguabase.org),该项目识别了近五十万个英语复合词组——例如“boiling water”或“Saturday night”,这些词组目前在字典中缺失,仅仅是因为它们包含空格。 一些评论者指出最初的例子不够好(建议“shut up”或“hot dog”会更有说明性),但核心问题在于难以定义一个词组何时成为值得收录于字典的“词”。这个问题在德语等容易创造复合名词的语言中可能*更严重*。 讨论还涉及非英语母语者在理解这些词组时遇到的困难,特别是词组动词,其含义并非从单个词语中显而易见。该项目被认为对于扩展词汇量很有价值,尤其是在文字游戏中,通过将这些“半透明”复合词识别为合法的语言单位。
相关文章

原文

There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.” I got interested in this because I make word games. I wanted to understand when to include words with spaces—and the legacy effects of traditional dictionaries on our sense of “what is a word.”

Here’s a slider. Look at expressions at different familiarity levels. Gold words are missing from traditional dictionaries:

Slide the knob to see missing terms.

Familiar Obscure

Crowd-sourced Wiktionary has 16 times more headwords than Merriam-Webster’s already hefty tabletop book. Yet even Wiktionary leaves gaps.

Focused on singletons

When dictionaries were planned and created, lexicographers focused on the building blocks of language—and overwhelmingly preferred individual words. Even the technical term is clinical: “multi-word expressions” (MWEs)—as if they’re a deviation from the norm.

Coverage drops as you go deeper

MW +Wikt

Merriam-Webster (green) covers just 18% of the top 10,000 MWEs, dropping to 7% by 100K. Adding Wiktionary (yellow) brings coverage to 75%, but even that drops to 48%.

A few selected, non-obvious expressions (called “opaque compounds”) would be included if they seemed interesting enough. And dictionary pages were not wasted on self-evident combinations. But what was lost?

  • Obvious meaning—often missing from dictionaries: hospital bills, smooth skin, angry person, exit door
  • Unpredictable meaning—usually in dictionaries: fat chance, melting pot, guilt trip, cold feet

Buried in the billions

Language is a vast combinatorial space. English has roughly 325K nouns, 60K verbs, and 85K adjectives. Multiply across grammatical patterns and you get about 250 billion possible two-word combinations.

Most are nonsense—“purple Wednesday,” “angry furniture.” But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.

What interests me is that within those billions, some combinations have crystallized into something more—expressions that carry conceptual weight beyond their parts. “Hot dog” isn’t just a warm canine. “Red tape” isn’t about colored adhesive.

I asked Claude to brainstorm candidates (explained below). It returned 774K MWEs. These are some of the types, and how often they appeared in Merriam-Webster (19,641 of 89,128 headwords), Oxford American (15,488 of 72,495 headwords), and Wiktionary (232,454 of 1.44 million English entries). In each graph below, the left side is more familiar, the right more obscure:

boiling water paper towel cold weather front door

blood pressure machine learning black hole climate change

high school living room best friend fast food

New York World War Coca Cola Michael Jackson

hot dog piece of cake red tape couch potato

kick out give up look after break down

null and void terms and conditions cease and desist rules and regulations

black and white back and forth hide and seek yes or no

take care have fun get married tell the truth

I am as well as each other such as

Totally missing In Wiktionary In traditional dictionaries: MW Oxford Both

Print dictionaries barely cover these expressions: Merriam-Webster has just 2.4%, Oxford has 2.1%. Even combined, they cover only 3.2%. Wiktionary does better at 30%.

The opaque compounds and phrasal verbs get the best coverage in Wiktionary. The named entities (proper nouns) are mostly encyclopedic topics—so they’re in Wikipedia. The technical terms drift into jargon as they get more obscure, but jargon isn’t useful for word games because too few players know it.

But see all the gold at the top of those little graphs for “transparent” and “semi-opaque”? That’s a deep reservoir of real MWEs—and we’ve been conditioned by dictionaries not to expect it. The old guard dictionaries, colored above in dark blue and dark green, overwhelmingly ignore them.

But even among the transparent—some have more conceptual weight than just their words. “Yellow pear” is just describing. “Boiling water” names a thing—what linguists call the naming function—a state of matter, a hazard, a cooking stage. The difference? One could have been a word.

Estimate source: These numbers are illustrative—no absolute counts exist. From 500K single-word terms, we sampled and classified parts of speech, including inflected forms and rare terms: ~325K nouns, 60K verbs, 85K adjectives (roughly twice WordNet’s counts). We multiplied across grammatical patterns. To estimate how many “make sense,” we randomly paired words: noun+noun 12% sensible, adjective+noun 64%, verb+noun 40%. The 30 billion is a weighted estimate.

Here are examples of are cousins of other words that did get lexicalized:

Got a word Didn’t
frozen water → ice boiling water
eyes closing briefly → blink closed eyes
dull persistent pain → ache severe pain

Should English have created a word for severe pain, other than the more emotionally tinged “agony” or “anguish”? I don’t know. But “severe pain” is tight enough to be a useful token in a word game. It names a thing.

Named entities

This analysis is about timeless words, not trivia. So I excluded 133K named entities—proper nouns like “New York City,” “Albert Einstein,” and “World War II.” Named entities are well-covered by Wikipedia, as are 47% of “technical” MWEs (“Parkinson’s disease”).

Scrabble famously excludes proper nouns. Crossword puzzles embrace them. Choosing how many proper nouns to include in a word game is tangled with cultural assumptions. Easy trivia is pointless, and hard trivia is alienating. Should all U.S. presidents be fair game? What about K-pop stars? British monarchs? Nobel laureates? The “right” answer depends on your audience, and that’s a design question, not a linguistic one.

The words that aren’t words

The line between “word” and “phrase” is fuzzier than dictionaries suggest. Words tend to carry more conceptual weight—but not always. Some words probably didn’t need to be coined; some phrases deserve more credit than they get. English is full of two- and three-word phrases that function as single semantic units—effectively “words”—but because they’re not fused into one orthographic word, they’re invisible to dictionaries and underappreciated as vocabulary. “Kindergarteners” is in the dictionary, but not “middle schoolers.” Newer professions—“car salesman,” “vacuum salesman,” “balloon seller”—are two words, but ancient professions might have a one-worder: jeweler, florist, fishmonger, tobacconist, stationer, mercer.

Meanwhile, in Spanish

Other languages sometimes have tidy one-worders for things English can only describe. Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.

But, concise one-word equivalents appear only among the top 2k most familiar MWEs—and even then, rarely. It’s as if languages collectively decided these concepts weren’t worth crystallizing into single words.

Exceptions exist—Norwegian forelsket for the euphoria of falling in love, Danish hygge for candlelit contentment, Portuguese saudade for longing. But these are famous precisely because they’re rare. Beyond the top tier of familiarity, the well runs dry everywhere. No language has a single word for “parking lot frustration” or “Sunday evening dread.” Generally, if English describes it with a phrase, so does everyone else.

联系我们 contact @ memedata.com