This story began two years ago, during the pre-launch of my book on the cognitive techniques of mathematicians. As authors do in those circumstances, I sent a handful of copies to selected high-profile people, in the hope that it would spark the public conversation.
I had a fantastic success rate with mathematicians. I sent a copy to Pierre Deligne, who wrote back saying he loved the book, “especially the insistence that learning or doing math is about changing one’s brain.” I sent a copy to Terry Tao, who blurbed the book. It was harder to get my foot in Steve Strogatz’s door, as he initially declined to even receive the book. But I insisted and, after a few days, he sent back a warm thank you note. He then blurbed the book and posted a big shout-out on the launch day.
Outside of academia, the pattern was quite different. Most people never responded. One exception was Paul Graham, who kindly replied to my two-liner cold email.
I knew it was a long shot, but I couldn’t not try. First, because I admire him and enjoy his writing on software and startups—and even more so his writing on writing itself and its transformative powers, which is remarkably aligned with central ideas from my book (and this essay.) This made me feel a genuine intellectual kinship.
Second, because I noticed that he was regularly posting on mathematics. It all clicked into place when I found out that his father was a mathematician—the alignment was no coincidence.
This made me excited and confident that he would love the book. But, unfortunately, I never heard from him again.
Let it be clear that I view this as completely normal. He is a busy person with a million competing interests and, no doubt, a massive reading backlog. Put simply, he owes me nothing.
But this was still frustrating as I had a third, more important reason for sending him the book: I wanted to engage with his thinking on a topic where I fundamentally disagree with him—genetic determinism.
Hereditarian statements are quite common in Paul Graham’s essays, though most of them are casual remarks that are peripheral to the core message. A good example is this passage from his otherwise great post on determination:
A good deal of willfulness must be inborn, because it’s common to see families where one sibling has much more of it than another. Circumstances can alter it, but at the high end of the scale, nature seems to be more important than nurture.
There isn’t much to object here, as he is using the appropriate amount of hedging (“must be”, “seems”). But the passage does reveal a strong hereditarian prior.
I brushed this one off as an “accident”, and filed it under the category of “unexamined hereditarianism”. The logical gap in the argument was so blatant that it was clear to me that he hadn’t put much thought into it. Indeed, it is not because a trait is striking, precocious, hard to nurture, and unexplained by the social context that it is necessarily innate.
This kind of hereditarianism was highly prevalent in the pre-genetics era and I’ve learnt not to take it at face value. A famous example is Henri Poincaré’s quip that “Mathematicians are born, not made”, which always struck me as being equally rigorous to Brillat-Savarin’s earlier quip that “One becomes a cook, but one is born a roaster.”
My fundamental disagreement isn’t about that—it’s about his sharing on X of pseudoscientific content produced by hereditarian blogger Cremieux, offering him a massive platform.
This is where this story starts for good—two years ago, I stumbled upon this powerful tweet by Paul Graham and almost fell off my chair:
According to this data assembled by @cremieuxrecueil, the difference in IQ between identical twins reared separately is only half a point greater than the difference between two tests of the same person.
This wasn’t a casual remark. This was hardcore genetic determinism, coming from a respected public figure, and backed by a striking visual that cited five independent scientific studies.
This tweet was viewed 3.5 million times. I can’t speak for the other 3,499,999, but I found it disturbingly persuasive.
The takeaway, it seemed, was pretty simple: when you separate two identical twins at birth, raise them in two random families, and test their IQs in adulthood, the two results are barely more divergent than two different tests of the same person.
I was genuinely shocked. My “centrist” views on heredity and intelligence—that genes matter but are only one factor—were being challenged by clear-cut empirical evidence. So were my views on mathematics, the core ideas in my book, and my own understanding of my life journey.
To explain why this visual was so destructive to my worldview, I need to get into the technicalities and, more specifically, into the notion of heritability, a statistical measure of how much genetics influence a given trait.
While heritability is an imperfect notion—with nasty caveats that are beyond the scope of this post—it has become a de facto standard and, for better or worse, the complex debate on cognitive inequality is often reframed as a one-dimensional debate on the heritability of IQ.
A pure blank-slatist would put it at 0%. A pure hereditarian would put it at 100%. Any reasonable person would put it somewhere in the middle, leaving two questions unresolved: where exactly?, and, what does the figure even mean?
The three simulations below illustrate three potential values for the heritability of IQ, 30%, 50%, and 80%. In each case, the dots represent 1000 random people, each placed according to their genetic potential for IQ (horizontal axis) and actual IQs (vertical axis). Heritability measures how close the dots are to fitting on a line. Mathematically, it is defined as the R^2 of the linear regression.
At 30%, one does observe a faint correlation between genetic potential and IQ. The correlation becomes clearer at 50%, while remaining quite noisy. This is an essential aspect to keep in mind: 50% may sound like a solid heritability figure, but the associated correlation is rather modest. It’s only at 80% that the picture starts to “feel like” a line.
Heritability is, by construction, a population-level aggregate. Before it can inform policy-making (or even personal decision-making), it must be interpreted at the level of individuals. This is where things get interesting and counterintuitive.
Let’s say, for example, that you are a genetically average person. How much does that affect your prospects?
Surprisingly, at 30%, it’s as if your genes didn’t matter at all. With an average potential, you still have a decent chance of landing at the top or bottom of the IQ distribution. Actually, in this specific random sample, one of three smartest people around (the top 0.3%) happens to have an almost exactly average genetic make-up, and the fourth dumbest person has a slightly above-average potential.
At 50%, being genetically average starts to limit your optionality, but the spread remains massive. Had you been marginally luckier—say, in the top third for genetic potential—you’d still have a shot at becoming one of the smartest people around.
At 80%, though, your optionality has mostly vanished. It’s still possible to move a notch upward or downward, but the game is mostly over. In this world, geniuses are born, not made.
This discussion is generally omitted by hereditarians, which is unfortunate, because it is the only way to clarify the stakes. There is a fundamental asymmetry in the debate. Heritability matters a lot when it is extremely high, because it then supports genetic determinism, but for the rest of the range the exact figure has little practical significance.
This is why “heritability centrists” rarely pay attention to the active research in behavioral genetics. Whether the heritability of IQ ends up being 15%, 30%, or 50%, this is “interesting” but not “Earth-shattering”. Of course, if the true figure is 80%, then it’s an entirely different story.
Speaking as a mathematician and author of a book on mathematical cognition, I should add two important remarks.
First, while social media thrives on memes about mathematical genius and the “IQ red pill”, mathematicians themselves aren’t that interested in IQ. This might be because they understand how the concept is defined. By design, IQ is a statistical construct that captures some common denominator, the “g factor”, among a variety of cognitive tasks. It’s as if someone had pooled a large number of biomarkers to extract a general health score, the “h factor”.
That doesn’t mean the construct is useless. The concept of good health is a meaningful one and simplistic proxies like “biological age” do make sense in particular contexts. IQ was designed as a cheap, scalable way of identifying students with learning disabilities and, for this specific use case, it is reasonably pragmatic.
Do note that the “g factor” is an abstraction that is impossible to directly measure. All you can get is a proxy that has some degree of correlation (say, 80%) with the actual thing you care about. Given that meaningful real-life cognitive abilities such as “mathematical talent” (whatever the expression means) are themselves only correlated with the “g factor”, you end up with a complex chain of more-or-less tight correlations that, as a whole, has no credible reason to be tight.
That is, unless the “g factor” captures a fundamental biological reality, in the same sense a CPU has hardware-defined characteristics like clock speed or L1 cache.
This is where the asymmetry becomes abysmal:
If the heritability of IQ is 50%, then I’d expect the correlation chain to be loose, and I wouldn’t be surprised if the heritability of math talent ended up around 20% or 30%.
By contrast, if the heritability of IQ is 80%, then I would eat my hat and concede that mathematical talent is primarily innate and mostly immutable. This is because 80% heritability would be indicative that the “g factor” is a core metric of the human body, hardcoded at the proteomic level. And, given that Raven’s Progressive Matrices, one of most g-loaded IQ tests, has such a strong smell of undergraduate algebra (it’s all about 3-cycles and permutation matrices), I would have to admit that this hardcoded ability is a close cousin of mathematical talent.
The second essential remark concerns the non-genetic part of cognitive inequality. A blank-slatist would claim that this is all social determinism. I absolutely don’t believe that and this is precisely why I insist on telling this story.
For all their performative battles, hereditarians and blank-slatists operate within the same deterministic frame (genetic determinism vs social determinism), denying any meaningful role for human agency and the messy, noisy process we call life.
This is where I strongly align with Paul Graham’s hacker ethos, clearly recognizable in this passage from his piece on intelligence:
I predict the gap between intelligence and new ideas will turn out to be an interesting place. If we think of this gap merely as a measure of unrealized potential, it becomes a sort of wasteland that we try to hurry through with our eyes averted. But if we flip the question, and start inquiring into the other ingredients in new ideas that it implies must exist, we can mine this gap for discoveries about discovery.
My book is literally about this “interesting place”. I have spent two decades mining the metacognitive gap between my (seemingly limited) innate ability and actual mathematical talent. Along the way, I became capable of producing research in algebra and geometry at a level I had never dreamt of.
From a behavior genetics perspective, the wiggle room between genetic and social determinism is captured by Turkheimer’s Third Law: “A substantial portion of the variation in complex human behavioral traits is not accounted for by the effects of genes or families.”
Not all this “substantial portion” is amenable to mining, and it could be that extreme mathematical talent results from idiosyncratic cognitive development that is impossible to replicate. But, on the other hand, there is a long tradition of mathematical “geniuses” such as Descartes, Einstein, or Grothendieck, who insisted that they had no special talent and owed their success to particular habits they happen to have developed (Descartes goes as far as calling this a “method”.)
I completely get why most people would assume that mathematical talent is innate: the subject is confusing and hard to teach, and some random kids are fabulously brilliant for reasons no-one fully understands. Yet there is an unheard, latent consensus within the mathematical research community that one can get better by adopting the right mindset and attitude, becoming fearless in the face of one’s limitations, and practicing the right “unseen actions” in one’s head with an inordinate amount of patience, curiosity, and persistence.
This mindset and attitude are the central topics of my book, and I believe that there are important takeaways for everyone—most notably, that anyone can make stellar progress. As Deligne puts it, “learning or doing math is about changing one’s brain”. This implies that cognitive transformation is possible, but it doesn’t imply that it is easy, nor that everyone has an equal chance.
But I don’t get to choose the heritability of IQ so that it fits my existing beliefs. A key lesson of mathematics is that if your beliefs are wrong, then there’s no point waiting to revise them.
If the true figure was proven to be 80%, then I would readily admit that I was delusional—and that Deligne was delusional with me, alongside Descartes, Einstein, and Grothendieck.
This is why Cremieux’s visual was so shocking to me. Taken at face value, it presented unobjectionable evidence that the heritability of IQ was extremely high, north of 80%.
But was this interpretation even correct?
Past the initial shock, I noticed the scale on the left and realized where the curious pixelated appearance was coming from—it was simply showing the actual granularity.
All five studies are very small, with a cumulative sample size well under 200 pairs. My analysis below focuses on the three largest ones (Burt, Minnesota, and Shields), all based on datasets of about 40-50 pairs. The remaining two studies are too tiny to justify a separate discussion, and they suffer from the same issues anyway.
The second thing that caught my attention was the brutal cliff, which felt unnatural. I would have expected a smoother, bell-curve-like drop.
Upon closer inspection, the cliff was primarily attributable to the bizarre distribution of the yellow dataset (Burt).. This was the first study I searched for, and I was in for a big surprise.
As a non-specialist and a long-term heredity centrist with little interest in the detailed picture, I hadn’t paid much attention to the troubled history of the IQ wars. And, in particular, I was unaware that Cyril Burt had been at the center of possibly the biggest scientific fraud scandal of the 20th century, precisely because of this very study. Most of his “twins reared apart” never existed, in the literal sense, as his dataset was a fabrication (hence the abnormal cliff).
I was left with two non-fraudulent, non-microscopic studies of “twins reared apart” whose primary publications were easy to retrieve:
Shields, J. (1958), Twins brought up apart, Eugenics Review, 50 (2), 115-123.
Bouchard, T. J., Lykken, D. T., McGue, M., Segal, N. L., & Tellegen, A. (1990), Sources of human psychological differences: The Minnesota study of twins reared apart, Science, 250 (4978), 223-228.
I am including the actual papers because I found them both readable and eye-opening. As my conclusion below will be pretty damning, I want everyone to be able to check for themself.
Here’s the bottom line: I have no idea what motivated Cremieux to include Burt’s fraudulent data, but even without it his visual is highly misleading, if not manipulative.
It is very unfortunate that 3.5 million people were exposed to it. I suspect that only a tiny fraction went past the initial impression, and the vast majority went away with the incorrect impression that these studies prove something massive about the heritability of IQ.
As I’ll argue, the raw data might be correct, but it is entirely inconclusive, for reasons that the authors of these studies were perfectly aware of and that clearly transpire from an honest reading of their publications. There is no reason to doubt that the twin pairs studied by Shields and Bouchard were actual “twins reared apart”, in a lax sense. But the mythical “twins reared apart” that enable a clean separation of nature and nurture simply do not exist.
Shields himself was extremely cautious in the concluding section of his 1958 article:
As I pointed out at the beginning, the study is not quite complete and the conclusions so far must be regarded as tentative.
… it must be agreed that in some respects the social differences between the families in which most of our twins were brought up were not very large. It would, therefore, be unwise when we come to evaluate the final results, to claim that genetical factors would show up as clearly if the environments of the twins were radically different.
Here he was alluding to a well-known shortcoming of his approach, the range-restriction of adoptive families. But this is only the last in a list of four major biases that affect his study and that of Bouchard.
There is a remarkable 1981 video interview of Thomas Bouchard (conducted by Niels Juel-Nielsen, the author of one the tiny studies), where Bouchard explains how he fell in love with the approach:
I just thought that was such a beautiful experimental design. I love simple designs that are cutting and definitive.
I agree that there is something irresistible in the idea of studying identical twins reared apart. It feels like a “natural experiment” that would magically tell nature and nurture apart. In fact, this appearance of simplicity, of being “cutting and definite”, is precisely what made Cremieux’s visual so compelling and viral.
But even under unrealistically perfect conditions, with twins truly separated at birth and placed in truly random families—as we’ll see, that was far from being the case—the experiment would still be confounded by a substantial bias: twins share a womb for nine months.
Bouchard was perfectly aware of this issue, as he noted in his publication:
MZA twins share prenatal and perinatal environments, but except for effects of actual trauma, such as fetal alcohol syndrome, there is little evidence that early shared environment significantly contributes to the variance of psychological traits.
But, disappointingly, his study satisfies itself with the “little evidence” argument and doesn’t even attempt to estimate this bias.
To a modern reader, it is surprising that the referees found that acceptable. One likely explanation is that the importance of prenatal and perinatal environments wasn’t fully understood at the time.
Since 1990, we have collected massive evidence that prenatal and perinatal environments are major contributors to the population-level IQ variance, and anyone who’s ever been around a pregnant woman is aware of that.
Indeed, pregnancy and birth are critical periods for cognitive development, with a multitude of well-documented risk factors:
fetal alcohol syndrome, which remains today one of the top contributors to population-wide IQ variance. The situation was likely much worse in the time of Bouchard, as his cohorts were born before the US Surgeon General advised women not to drink during pregnancy, and decades before FAS was recognized as the tip of a much broader spectrum. It was only recently agreed that even occasional drinking is damaging (these labels on wine bottles are there for a reason).
maternal smoking (which, again, was much more prevalent in Bouchard’s cohorts.)
lead exposure (not anecdotal, again, given the time period)
birth asphyxia (twins have a higher risk; this might concern one twin, or both)
congenital infections (rubella, cytomegalovirus and many others)
nutrient deficits (folic acid, etc.)—just take a look at the recommended supplements for pregnant women.
These non-genetic factors are far from negligible and, in fact, they dominate the list of known direct causes of cognitive disabilities, alongside a few genetic and chromosomic disorders. Once you aggregate their contribution, plus a provision for unquantified prenatal factors that are still under the radar, this actually makes it hard to believe that heritability could be above 80%, as this would leave very little variance left for everything else—that is, for the whole post-uterine experience (what we call life.)
Regardless of their prevalence in the general population, these risks are compounded for twins reared apart, as 1/ twin pregnancies have a much higher chance of complications, and 2/ pregnancies resulting in adoptions have a much higher chance of being associated with inadequate prenatal care.
Note that the existence of confounders isn’t necessarily a death blow to the approach. By estimating their impact, one might manage to correct for them.
In fact, it does seem that Bouchard and his team attempted to do just that, and this is where the story becomes truly bizarre. But, wait, we have three more biases to cover before we get there.
On average, the twins in the Bouchard study were separated at 5.1 months. In the Shields study, 25 pairs were separated during their first year, but 6 were separated during the second year, 4 between ages two and four, and individual pairs were separated as late as seven, eight, and nine years of age.
In other words, the phrase “reared apart” should be taken with a substantial grain of salt. It subliminally suggests “separated at birth”, while it effectively means “separated in early childhood or later.”
Whatever can be said about bias #1 applies verbatim here. In The Blank Slate, Steven Pinker does argue that our societies overestimate the importance of the first few months. He might be correct, but it is difficult to believe that these months do not matter at all. (I’m writing this as the father of two kids, who was genuinely surprised by the intensity of the emotional bound and communication abilities that develop from the very first days.)
Some kids are properly fed, loved, cared for, talked to, and some are strapped in bouncers in front of Cocomelon. And, in all likelihood, this isn’t a cosmetic nuance.
To make things worse, the kids in the Bouchard and Shields studies are unlikely to have experienced uneventful early childhoods—a telltale sign is that they were eventually put up for adoption. Which brings us to the next bias:
Shields gives a fascinating account of what triggered these adoptions:
The reasons for these twins being parted were various. Illegitimacy accounts for six pairs, the death of the mother for eight. The most frequent reason was that the mother was ill, pregnant again or considered unable to look after both twins [...] In two cases the children were removed from home by the Poor Law authorities. In one case the father sold one twin to settle his debts.
For all their diversity, these twin pairs all had one thing in common: they experienced something out of the ordinary and—except in the last example—they experienced it together.
Whether their mother was putting booze in the baby formula, whether she had paranoid schizophrenia, whether she died from a rare monogenic disease, or whether both parents died in the sinking of the family yacht, these extraordinary events constitute a massive selection bias for a multitude of confounders that will mess up the analysis if you fail to control for them.
The first three biases are easy to grasp, because it’s intuitively obvious that congenital lead exposure or dysfunctional postnatal care would harm both twins in ways that would then be misattributed to genetics.
This fourth bias is more subtle because it is fundamentally about population-wide statistics.
For the experimental design to be “cutting and definitive”, as Bouchard dreamt it to be, the twins must be selected at random (bias #3), incubated in 100%-consistent artificial wombs (bias #1), and placed right after birth (bias #2) in truly random families—which never happens in real-life.
This is bias #4. In our societies, adoptive families are the opposite of random. They tend to have specific family structures, live in specific neighborhoods, and share a specific set of values and cultural habits. The simply act of wanting to adopt a kid—and being entrusted with one—filters out the most dysfunctional environments.
If you take two seeds from the same apple and plant them in separate pots, you might be able to derive meaningful insights on the relative importance of the environment. That is, unless you place them side-by-side on the same balcony.
In The Blank Slate, in the chapter where he argues that adoption studies (not just of twins) suggest that family environments matter much less than we used to think, Pinker adds this essential caveat:
An important proviso: Differences among homes don’t matter within the samples of homes netted by these studies, which tend to be more middle-class than the population as a whole. But differences between those samples and other kinds of homes could matter. The studies exclude cases of criminal neglect, physical and sexual abuse, and abandonment in a bleak orphanage, so they do not show that extreme cases fail to leave scars.
What if these extreme cases were the childhood equivalent of fetal alcohol syndrome, maternal smoking, cytomegalovirus, and extreme preterm births—rare occurrences that contribute an inordinate share of the population-wide IQ variance?
Keep in mind, here, that the variance is the average of the square of the distance to the mean: a single individual who is three standard deviations away from the mean contributes as much as 36 people who are half a standard deviation away from the mean.
These aren’t new insights brought by our modern sophistication. Shields was fully aware of bias #4 as early as in 1958 and, in fact, this is precisely why he was so prudent in his conclusions. He had good reason to be prudent, for his dataset was catastrophically confounded by range-restriction. Just read his own account:
In twelve cases the twins were brought up in families that were unrelated to one other. In the remainder they were brought up by different branches of the same family, in twenty cases one twin staying with his own mother.
We should praise Shields for his insight and honesty. He tried something, it didn’t work, and that’s fine—this is what good science often looks like.
But, frankly, it doesn’t make any sense to repackage his results into shiny visuals, as if they held any significance. At this point, from an insights perspective, this is pretty much game over for the Shields dataset.
We are left with only the Bouchard study and this is where things are going to get… how to put it?… “embarrassing”.
As he often told (for example in the above interview with Juel-Nielsen), Bouchard started the massive Minnesota Study of Twins Reared Apart after he got word of the “Jim Twins.”
James Arthur Springer and James Edward Lewis were born in 1940. They had been separated soon after birth and placed in separate adoptive families in Ohio. Their reunion, in early 1979, became an instant media sensation, owing to the crazy “coincidences” in their story. All this, alongside Bouchard’s early interest in their case, is documented in this December 1979 article in the New York Times:
Each married and then divorced a woman named Linda. Their second wives were both named Betty.
Springer named his first son James Allan. Lewis named his first son James Alan.
Each man grew up with an adopted brother named Larry.
During childhood, each owned a dog named Toy.
Both twins had law‐enforcement training and had worked part time as deputy sheriffs in their Ohio towns 70 miles apart.
They shared many common interests, such as mechanical drawing, block lettering and carpentry.
….
When the Jim Twins were reunited, Bouchard was a 41 years old academic. His instinct told him that this was his moment—his unique opportunity to leave a durable mark in science—and he went on to bet the rest of his career on the project.
It’s a pity that Bouchard had such a poor statistical intuition, as it would have been the right time to pause for a serious reassessment of the Jim Twins miracle.
In statistics, when coincidences are too good to be true, there’s a good chance that something nasty is going on in the background. True coincidences do happen, acts of God or random “flukes” from which there is absolutely nothing to learn. But I would never use this argument to brush off the Jim Twins story, as I have never seen a fluke of this magnitude.
Alternatively, it could be that the data was fabricated. But I found no evidence that the Jim Twins story wasn’t authentic.
Are we all good? Can we move on to accepting the Jim Twins story as a compelling case of genetic determinism? Not quite, because we still haven’t ruled out the nastiest of all scenarios—that the Jim Twins story might have been authentic, but cursed, and Bouchard should have run away from it because it meant the exact opposite of what he thought it meant.
After I quit my permanent academic job in pure mathematics, I went on to launch a tech startup that specialized in mining the latent sociological information present in seemingly innocuous consumer data such as first names, zipcodes, transaction histories and minute behavioral proxies.
I never dated any Linda, nor any Betty, but I doubt this says much about my genes. I rather suspect that it says something about my year of birth and social background (and the fact that I grew up in Paris.)
Likewise, I’m inclined to think that, if both Jim Twins had an adopted brother named Larry, this is telling us something about the psychosocial proximity of their adoptive families, rather than the telepathic influence of genes.
Interestingly, just the latent feature space of pet names already captures incredibly subtle information about the cultural backgrounds and values of their owners.
The Jim Twins story might be meaningless from a population genetics perspective, but still carry a mesmerizing lesson on the social conformism of the white family-oriented middle-class in suburban Ohio in the 1940s and 1950s.
One common weakness of the Shields and Bouchard studies is their reliance on “naturally occurring” pairs of “twins reared apart”. What if we could deliberately create such pairs?
This sounds like a nazi experiment, yet a rogue adoption agency did try to pull it off. This incredible story is related in Three Identical Strangers, a 2018 documentary about three identical triplets who were deliberately placed, respectively, in a “working-class” family, a “middle-class” family, and an “upper-class” family.
The documentary is both hilarious and tragic, and holds several fascinating lessons. First, of course, that facial features and overall appearance are strongly determined by genes—but we already knew that.
Second, that genetics alone can’t predict life outcomes. The documentary does mention that all three triplets suffered from mental health problems in their youth—which is hardly exceptional in the general population and, in fact, fairly common among adoptees. Yet David and Robert, two of the triplets, seem to have moved forward reasonably well into middle-age, while Edward sadly took his life at age 34.
Third, that true randomization is hard. Indeed, even this deliberate, unethical manipulation wasn’t immune to range-restriction. The documentary opens with the account of the triggering event to the reunion, a farcical story of mistaken identity in the dorms of Sullivan County Community College, where two of the triplets have been students.
What a striking coincidence, isn’t it? Either the triplets carried a yet-undiscovered Sullivan County Community College gene, or something went horribly wrong in the randomization process.
It’s time for the most troubling part of the analysis.
As he led the Minnesota study throughout the 1980s, it isn’t plausible that Bouchard was unaware of bias #1 (mentioned in his paper), bias #2 (the paper provides the mean and standard deviation of the separation age), and bias #4 (already discussed by Shields in 1958.)
It isn’t plausible either that he was unaware that there was a simple method for estimating these biases and filtering them out, since he did exactly what one is supposed to do to apply the method.
Indeed, at the end of above excerpt from his 1981 discussion with Juel-Nielsen, Bouchard reveals that he is simultaneously building a dataset of fraternal twins reared apart. This is the first step of the process, after which one can proceed to a differential analysis of the IQ correlations between MZA (short for monozygotic twins reared apart) and DZA (short for dizygotic twins reared apart) pairs.
The DZA sample serves as a control group. It is subject to the same exact biases as the MZA dataset, but DZA twins share only 50% of their genome while MZA share 100%. From there, one can recover a debiased estimate for the heritability. This makes use of Falconer’s formula, a statistical trick based on simplifying assumptions that are never satisfied in real-life, but provides a reasonable first-order approximation.
The technique is standard in behavior genetics. It is the bread-and-butter technique of “twins reared together” studies, and everyone agrees that biased correlations cannot be taken at face value.
It is impossible that Bouchard, publishing in 1990, could have ignored any of this. The entire discussion—from the imperious need to correct for biases to the standard methodology for doing so—was fully laid out in Douglas Scott Falconer’s seminal book, Introduction to quantitative genetics, whose first edition was published in 1960:
Now imagine that you have spent the past ten years assembling the best available sample of MZA and DZA pairs. You are about to publish a landmark study aiming to provide a “cutting and definitive” estimate of the heritability of IQ, resolving a major scientific debate. Which of these options would you choose?
Base your estimate on the standard methodology, a differential analysis of MZA and DZA pairs.
Present the MZA data, then argue through ad hoc, complex reasoning that biases are under control—without ever showing any DZA data to back that claim.
Strangely, Bouchard and his team went for option 2. Moreover, they made the fabulously awkward decision to acknowledge the existence of a DZA control group while simultaneously withholding the data, citing a dog-ate-my-homework excuse:
Due to space limitations and the smaller size of the DZA sample (30 sets), in this article we focus on the MZA data (56 sets).
No, you’re not dreaming—Thomas Bouchard just pulled a Fermat right before your eyes!
“I have a marvelous proof but the margin is too narrow” is a notoriously weak argument in mathematics. In experimental science, it is beyond egregious, since readers can’t even try to reconstruct the missing data by themselves.
The “space limitations” argument is highly implausible. First, because it would have taken essentially zero space to provide the one missing metric that could have bolstered, or tarnished, the credibility of their heritability estimate. Second, because this shouldn’t have been about adding something to the publication—it should have been about writing it using an entirely different approach.
Indeed, if the IQ correlation within MZA pairs was statistically significantly higher than the correlation within DZA pairs, then this silver bullet argument should have been put at the center of the article. By contrast, if no meaningful differential signal was found, then this essential information (that would have put the entire approach into question) shouldn’t have been withheld.
As for the “smaller size” argument, it isn’t convincing either. Statistical noise varies according to the inverse of the square root of the sample size, so the difference between the two series isn’t massive: √56/√30 ≈ 1.37, and a 37% precision gap hardly justifies to trust one sample while throwing away the other one.
It could be that the article was just poorly written, that Bouchard and his team had a silver bullet argument and “forgot” to use it. But who can seriously believe that?
Piecing everything together, a more credible scenario would be that Bouchard set out to conduct a proper study, then found out that there was no meaningful signal. That, in itself, wouldn’t have been surprising, as the study was statistically underpowered and exposed to multiple confounders. With 56 MZA pairs and 30 DZA pairs, both h^2 estimates were highly vulnerable to noise, and a few outliers here or there could have messed up the entire analysis.
Jay Joseph (whose blanket rejection of any role for genes does not align with my views) claims to have located the missing DZA data in subsequent publications by the Minnesota team, and that the two h^2s did not differ at a statistically significant level—meaning that the study had failed to identify any genetic contribution. That is a very serious objection, and Bouchard’s evasive response is even more concerning (he reiterates the sample size excuse without ever addressing the specific claim about h^2.)
As an outsider with no quarrel or vested interest, I find it incomprehensible that such a fundamental question could remain unanswered 35 years after the study was published: was there any meaningful differential signal between the MZA and DZA pairs, or not?
To be clear: this issue isn’t to second guess who did what, when and why, but to clarify the actual content of the raw data. Pre-replication crisis psychology was operating under an entirely different set of norms, and practices that we now identify as fraud used to be considered “fair game”. For example, it wouldn’t be that surprising if a little bit of “cleaning” had taken place between the primary data collection and the tabulated results, as this used to be common practice. Extracting compelling results from microscopic datasets is a delicate art and there are many documented instances where a well-intended human hand was present to “help nature.” Or, to put it bluntly: there is no way a non-preregistered observational study of 56 MZA pairs would make it to Science in 2025.
Bouchard notes that “The study of IQ is paradigmatic of human behavior genetic research.” I can only imagine how it would have felt to come out empty-handed after a ten-year effort to deliver a clear-cut answer. Had this happened to me, I have no idea how I would have reacted. I assume it would have been extremely tempting to reframe the story in a way that wasn’t technically false—but wasn’t entirely true either.
What infuriates me is the repackaging of five zombie datasets into a shiny visual that fools 3.5 million people, including smart people like Paul Graham. Out of curiosity and integrity, I accepted the challenge to my worldview. This sent me down a three-month rabbit hole to learn everything I could on the subject, only to discover that the whole thing was a mirage.
As I eventually found out, all this was well known to those with a strong interest in the topic. The study of “twins reared apart” is widely regarded as a scientific dead-end and behavioral geneticists have entirely moved on to new strategies.
A telltale sign is that no-one since Bouchard has tried to replicate his approach. Charles Murray, the co-author of The Bell Curve—hardly a blank-slatist—effectively pronounced a funeral eulogy for Bouchard’s vision in his 2020 book:
For many people, “twin studies” brings to mind the famous Minnesota Study of Twins Reared Apart. It got so much publicity because it produced so many dramatic examples of similarities in adults who had never met each other. When the separated twin brothers who inspired the study were reunited in their 30s, it was discovered that as children both had had a dog named Toy. As adults, both had been married twice, first to wives named Linda and then to wives named Betty…
…but the method’s potential was limited. Separation of identical twins at birth happens so seldom that large sample sizes are impossible. The range of environments in which separated twins are raised is narrow—adoption agencies don’t knowingly place infants with impoverished or dysfunctional parents. In contrast, it is not difficult to assemble large samples of twins who have been raised together.
In a nutshell, the current debate revolves around this:
The classic “twins reared together” protocol leads to heritability estimates in the 50-70% range. But everyone knows that the assumptions in Falconer’s formula do not hold in real-life and there is no general consensus on how to fix this.
Direct genomic studies (GWAS) lead to much lower estimates (around 20% or below). But they are subject to their own data and modelling limitations.
Most people agree that twin studies overestimate heritability while genomic studies underestimate it, and the truth must lie somewhere in-between. People are working to resolve the gap with new techniques and meta-arguments. As far as I understand, the frontline seems to be stabilizing around the 30-50% range. Sasha Gusev argues for the lower end of that band, but not everyone agrees.
This is an intricate scientific debate that I enjoy watching from the sidelines, although my personal stake is now pretty low. As discussed, the nuance between 30% and 50% heritability is scientifically interesting but, at the individual level, I don’t find it very informative (especially since no current technology is capable of measuring one’s true genetic potential.)
I am not a behavioral geneticist and, paradoxically, this is precisely why I felt the need to write this detailed account of my thought process. Despite my solid background in mathematics and data analysis, it cost me serious effort to build a watertight debunk, and I thought it was worth sharing it with the other 3,499,999.
As I emerged from the rabbit hole, this is what struck me—this wasn’t really about Cremieux’s slide, this was about eradicating the “twins separated at birth” trope that had infected me as a teen and distorted my expectations of what was scientifically plausible.
It is fundamentally hard to believe that such a simple and beautiful design could be so profoundly flawed. And yet, people have tried it over and over again, and failed over and over again.
From a public conversation perspective, there are three important takeaways:
1/ There is no “cutting and definitive” method for estimating the heritability of IQ. This is the saddest and most annoying part of the story. Bouchard’s dream is dead and we have nothing to replace it with. All current methods and estimates, from “twins reared apart” to GWAS and RDR, make use of statistical models whose underlying assumptions are eminently debatable. I have my own reading of the current debate, based on my own mathematical intuition of how these models behave under mild and moderate violations of their core assumptions, but it is very hard to communicate and, of course, there is no one-stop referable source for the layperson.
For all practical purposes, this is Brandolini’s law on steroids. Every time my book is debated on social media, I’m getting insulted by people who are firmly convinced that I am a lunatic, an ideologue, or a grifter, for asserting that one can get better at math. My interview with Quanta Magazine resulted in a flame war on Hacker News, where my supporters were called “flatearthers” and “secular creationists.” My favorite insult to date is one that I received on X: “you actually are half commercial person half prophet selling his gene pool experience as a standard worldview."
My sad conclusion, at this point, is that there is no short and simple way to respond to these people.
2/ Statistical nuances and caveats are very ineffective arguments. This I already knew, but the reminder was quite brutal. In the early stages of my journey, I was effectively battling a steelman version of Cremieux’s visual, since I had been really impressed by the data and was genuinely thinking to myself: “wow, this looks big!” The first sources I found did mention some “biases” in Bouchard’s study—but, come on!, who cares about biases when the data speaks for itself?
As we learned in the schoolyard, nitpicking is for losers, and no one is going to listen to technical jargon in the face of glaring evidence.
This explains why this piece is so long. I don’t think one can effectively argue against the Jim Twins narrative. To fully understand what was going on, I had to build my own counterstory, with secondary characters and subplots, that transformed the abstract concept of “biases” into concrete details that were memorable and real.
3/ People overestimate the importance of 30%-50% heritability. This is the one optimistic conclusion, because it is easy to address. As the debate is narrowing down to the 30-50% band, some hereditarians are tempted to reframe this as a victory for their side, as if they had always been fighting against blank-slatists. But even if the end figure happens to be at the higher end of this band, that would still constitute a decisive victory for heredity centrists.
The question never was about whether or not genetic differences contribute to the spread of intellectual talent—they obviously do.
The question always was about the “interesting place” Paul Graham talked about, the meaningful space between genetic potential and actual achievement, and whether or not it really existed. And, at 30% or 50%, this place surely exists.
It is large, wild, and messy. Some regions are accessible to nurture and some are not. Some are entirely governed by chance. But it is open for mining, and there’s plenty of room for human agency.