There are nearly half a million compound phrases that aren’t in any dictionary—simply because they contain spaces. “Boiling water.” “Saturday night.” “Help me.” I got interested in this because I make word games. I wanted to understand when to include words with spaces—and the legacy effects of traditional dictionaries on our sense of “what is a word.”
Here’s a slider. Look at expressions at different familiarity levels. Gold words are missing from traditional dictionaries:
Crowd-sourced Wiktionary has 16 times more headwords than Merriam-Webster’s already hefty tabletop book. Yet even Wiktionary leaves gaps.
Focused on singletons
When dictionaries were planned and created, lexicographers focused on the building blocks of language—and overwhelmingly preferred individual words. Even the technical term is clinical: “multi-word expressions” (MWEs)—as if they’re a deviation from the norm.
Coverage drops as you go deeper
MW +Wikt
Merriam-Webster (green) covers just 18% of the top 10,000 MWEs, dropping to 7% by 100K. Adding Wiktionary (yellow) brings coverage to 75%, but even that drops to 48%.
A few selected, non-obvious expressions (called “opaque compounds”) would be included if they seemed interesting enough. And dictionary pages were not wasted on self-evident combinations. But what was lost?
- Obvious meaning—often missing from dictionaries: hospital bills, smooth skin, angry person, exit door
- Unpredictable meaning—usually in dictionaries: fat chance, melting pot, guilt trip, cold feet
Buried in the billions
Language is a vast combinatorial space. English has roughly 325K nouns, 60K verbs, and 85K adjectives. Multiply across grammatical patterns and you get about 250 billion possible two-word combinations.
Most are nonsense—“purple Wednesday,” “angry furniture.” But roughly 15% are plausible: “wooden chair,” “morning coffee.” That’s still 30 billion sensible pairs.
What interests me is that within those billions, some combinations have crystallized into something more—expressions that carry conceptual weight beyond their parts. “Hot dog” isn’t just a warm canine. “Red tape” isn’t about colored adhesive.
I asked Claude to brainstorm candidates (explained below). It returned 774K MWEs. These are some of the types, and how often they appeared in Merriam-Webster (19,641 of 89,128 headwords), Oxford American (15,488 of 72,495 headwords), and Wiktionary (232,454 of 1.44 million English entries). In each graph below, the left side is more familiar, the right more obscure:
boiling water paper towel cold weather front door
blood pressure machine learning black hole climate change
high school living room best friend fast food
New York World War Coca Cola Michael Jackson
hot dog piece of cake red tape couch potato
kick out give up look after break down
null and void terms and conditions cease and desist rules and regulations
black and white back and forth hide and seek yes or no
take care have fun get married tell the truth
I am as well as each other such as
Totally missing In Wiktionary In traditional dictionaries: MW Oxford Both
Print dictionaries barely cover these expressions: Merriam-Webster has just 2.4%, Oxford has 2.1%. Even combined, they cover only 3.2%. Wiktionary does better at 30%.
The opaque compounds and phrasal verbs get the best coverage in Wiktionary. The named entities (proper nouns) are mostly encyclopedic topics—so they’re in Wikipedia. The technical terms drift into jargon as they get more obscure, but jargon isn’t useful for word games because too few players know it.
But see all the gold at the top of those little graphs for “transparent” and “semi-opaque”? That’s a deep reservoir of real MWEs—and we’ve been conditioned by dictionaries not to expect it. The old guard dictionaries, colored above in dark blue and dark green, overwhelmingly ignore them.
But even among the transparent—some have more conceptual weight than just their words. “Yellow pear” is just describing. “Boiling water” names a thing—what linguists call the naming function—a state of matter, a hazard, a cooking stage. The difference? One could have been a word.
Estimate source: These numbers are illustrative—no absolute counts exist. From 500K single-word terms, we sampled and classified parts of speech, including inflected forms and rare terms: ~325K nouns, 60K verbs, 85K adjectives (roughly twice WordNet’s counts). We multiplied across grammatical patterns. To estimate how many “make sense,” we randomly paired words: noun+noun 12% sensible, adjective+noun 64%, verb+noun 40%. The 30 billion is a weighted estimate.
Here are examples of are cousins of other words that did get lexicalized:
| Got a word | Didn’t |
|---|---|
| frozen water → ice | boiling water |
| eyes closing briefly → blink | closed eyes |
| dull persistent pain → ache | severe pain |
Should English have created a word for severe pain, other than the more emotionally tinged “agony” or “anguish”? I don’t know. But “severe pain” is tight enough to be a useful token in a word game. It names a thing.
Named entities
This analysis is about timeless words, not trivia. So I excluded 133K named entities—proper nouns like “New York City,” “Albert Einstein,” and “World War II.” Named entities are well-covered by Wikipedia, as are 47% of “technical” MWEs (“Parkinson’s disease”).
Scrabble famously excludes proper nouns. Crossword puzzles embrace them. Choosing how many proper nouns to include in a word game is tangled with cultural assumptions. Easy trivia is pointless, and hard trivia is alienating. Should all U.S. presidents be fair game? What about K-pop stars? British monarchs? Nobel laureates? The “right” answer depends on your audience, and that’s a design question, not a linguistic one.
The words that aren’t words
The line between “word” and “phrase” is fuzzier than dictionaries suggest. Words tend to carry more conceptual weight—but not always. Some words probably didn’t need to be coined; some phrases deserve more credit than they get. English is full of two- and three-word phrases that function as single semantic units—effectively “words”—but because they’re not fused into one orthographic word, they’re invisible to dictionaries and underappreciated as vocabulary. “Kindergarteners” is in the dictionary, but not “middle schoolers.” Newer professions—“car salesman,” “vacuum salesman,” “balloon seller”—are two words, but ancient professions might have a one-worder: jeweler, florist, fishmonger, tobacconist, stationer, mercer.
Meanwhile, in Spanish…
Other languages sometimes have tidy one-worders for things English can only describe. Spanish carves up time with precision English lacks: madrugada for the pre-dawn hours, atardecer for late afternoon waning into evening. The mid-day nap was so compelling we adopted the siesta into English.
But, concise one-word equivalents appear only among the top 2k most familiar MWEs—and even then, rarely. It’s as if languages collectively decided these concepts weren’t worth crystallizing into single words.
Exceptions exist—Norwegian forelsket for the euphoria of falling in love, Danish hygge for candlelit contentment, Portuguese saudade for longing. But these are famous precisely because they’re rare. Beyond the top tier of familiarity, the well runs dry everywhere. No language has a single word for “parking lot frustration” or “Sunday evening dread.” Generally, if English describes it with a phrase, so does everyone else.