```GPT 在 1 到 100 之间猜测```
GPT Guesses Between 1 and 100

原始链接: https://github.com/exmergo/research-chatgpt-guesses-between-1-and-100

本实验旨在探讨大型语言模型(LLM)在被要求“从1到100之间随机选一个数字”时,究竟表现得如同均匀随机数生成器,还是继承了人类的偏见。通过对GPT-4.1在温度参数为1.0时生成的10,000个独立样本进行分析,研究揭示出该模型的输出远非随机。 模型并未产生均匀分布,而是重现了明显的人类化模式,例如偏爱37、73以及具有迷因色彩的42等“看起来很随机”的数字。同时,模型对圆整数字表现出显著的厌恶,这些数字在统计上被刻意避开了。有趣的是,尽管模型反映了许多人类倾向,但在69这类“粗俗”的迷因数字上却表现出偏差,这很可能是由于安全防护机制的作用,表明其输出是人类行为经过调节后的折射,而非原始拷贝。 最终,研究结果证实,大型语言模型并不能作为公平的“骰子”使用;相反,它们生成的分布呈现出因训练数据和安全过滤而形成的“块状”特征。本项目提供了一套完全可复现的流程,允许用户分析原始数据或收集新样本,以进一步探索这种行为偏见。

这段 Hacker News 讨论聚焦于一个 GitHub 项目,该项目分析了人工智能模型在被要求猜测 1 到 100 之间的数字时的表现。研究发现,模型并没有表现出真正的随机性,而是在选择中呈现出明显的模式和偏见。 评论者们认为这些非均匀分布非常有趣,并指出模型偏离随机性的具体方式可以提供关于其训练数据的洞察——例如由于《银河系漫游指南》等文化参考,数字 42 的出现频率较高。一些用户建议,这种行为甚至可以作为识别底层模型的“指纹”。另一些人则指出,大语言模型(LLM)是在人类生成的数据上训练的,自然会继承人类对特定数字的认知偏差,这颇具讽刺意味。讨论还涉及了本福特定律(Benford’s Law)是否适用于这些输出,并强调了人们对人工智能生成内容日益普及的普遍厌倦感。
相关文章

原文
Exmergo Viz - I asked GPT to pick a random number between 1 and 100 (sample 10k)

An interesting thing about humans is that they are not good random number generators.
If you ask a person to "pick a random number between 1 and 100", they are remarkably predictable. Answers cluster on 37 and 73, on "messy" numbers, and on memes like 42 and 69, while round numbers are quietly avoided. A true random generator would instead produce a flat, uniform distribution.

This project asks gpt-4.1 the same question 10,000 times and characterizes the distribution it produces, measured against a uniform baseline. Does an LLM, which is trained on human text, behave like a fair die, or does it inherit the lumpy human pattern?

Full design and methodology: docs/LLM Random Bias Experiment SDD.md.

This experiment is an LLM-focused follow-up to two well-known explorations of human number-picking bias.

Full experimental design is in the SDD; the essentials:

  • Model. gpt-4.1 (OpenAI), called via the Responses API. It is a non-reasoning model. It emits a direct answer rather than deliberating; what we're measuring is its raw output distribution, not a reasoning strategy. The exact model string is recorded in every raw-CSV row (Model column) and in data/raw/run_metadata.json, so the dataset is self-describing.
  • Sample size. N = 10,000 independent calls — enough for a chi-square goodness-of-fit test and per-number proportions stable to ~±0.5 pp.
  • Sampling. temperature = 1.0, so the model exercises its full sampling distribution. This is the experiment: at low temperature it would just repeat one number.
  • Prompt. A fixed system prompt instructs the model to output only one integer between 1 and 100; the user prompt requests the number and carries a unique uuid4. (The UUID is request-tracing hygiene, not cache-busting — at temperature 1.0 every call should sample independently regardless.)
  • Baseline. The result is compared against a uniform distribution — what a fair generator would produce — not against human data (see Assumptions).
  • Pipeline. Four stages — collect → clean → transform → stats, detailed below. Cleaning validates every answer is an integer in [1, 100] and reports the rejection rate.

Assumptions & Limitations

This is an illustrative probe, not a definitive study. Key caveats — see the SDD's Limitations section for the formal treatment:

  • Single model. Results describe gpt-4.1 only and do not generalize to other models or providers.
  • "Randomness" is a sampling artifact. The model is not a random number generator; it samples a learned token distribution. We characterize that distribution — we do not claim the model is trying to be random.
  • Prompt- and temperature-dependent. A different prompt wording or sampling temperature could shift the distribution. Both are fixed and documented.
  • Not "ChatGPT the product." This tests a model through the API at a fixed temperature — not the consumer ChatGPT app, which adds routing, tools, and a system prompt outside our control.

gpt-4.1 is emphatically not a uniform random generator. A chi-square goodness-of-fit test against a uniform distribution (N = 10,000, df = 99) returns χ² = 15,604, p ≈ 0 — the deviation is so large it underflows any significance threshold. Asked for a random number, the model produces a lumpy, distinctly human-shaped distribution.

It reproduced the classic human spikes

Number Picked vs. uniform chance Human reputation
37 4.0× "the most random number"
42 4.0× Hitchhiker's Guide meme
73 3.4× the other well-known spike

The five most-picked numbers overall — 47, 57, 72, 37, 42 — lean heavily on numbers ending in 7 (three of the five), the same "number that feels random" pull seen in humans.

It avoids round numbers even harder than humans

All multiples of 10, except for 10 itself, were picked exactly 0 times in 10,000 calls. 10 was picked exactly once. Humans avoid round numbers — gpt-4.1 essentially refuses them.

One number breaks the human pattern. 69 is a meme number humans over-pick. gpt-4.1 under-picks it (0.29× expected: ~29 occurrences against ~100). The model inherited the "smart" meme (42) and not the crude one. Our hypothesis is that this is a product of safety guardrails during pre-training and post-training. It is the most interesting aspect in the dataset: the model's bias is not a raw copy of human bias but a moderated version of it.

The hypothesis holds. An LLM trained on human text, asked to be random, reproduces human random-number bias: the pull toward 37 and 73, the meme spike at 42, the aversion to round numbers — with one guardrail-likely exception. The interactive distribution chart shows the full 1–100 shape.

All figures from data/processed/stats_summary.csv.

collect → clean → transform → stats. Each stage reads the previous stage's committed CSV, so any stage can be re-run on its own.

Stage Module Output
Collect llm_random_bias.collect data/raw/chatgpt_random_results.csv
Clean llm_random_bias.clean data/processed/chatgpt_random_clean.csv
Transform llm_random_bias.transform data/processed/distribution.csv
Stats llm_random_bias.stats data/processed/stats_summary.csv

This project uses uv for everything.

Path 1 — Analysis only (free, no API key)

The raw dataset is committed to this repo, so you can reproduce the entire analysis without spending a cent:

uv run python -m llm_random_bias.clean
uv run python -m llm_random_bias.transform
uv run python -m llm_random_bias.stats

Path 2 — Fresh data collection (needs an OpenAI API key)

cp .env.example .env          # then edit .env and add your OPENAI_API_KEY
uv run python -m llm_random_bias.collect
# then run clean / transform / stats as in Path 1

Cost & runtime: ~10,000 short calls to gpt-4.1 cost roughly US$2 and finish in a few minutes at the default concurrency. The collector refuses to overwrite an existing raw CSV — delete it first to re-collect.

The distribution bar chart is built in Exmergo Viz (our AI dashboard agent) directly from data/processed/distribution.csv. The fully interactive data viz can be viewed here.

uv run ruff check .
uv run ruff format .
uv run mypy src
uv run pytest

See CONTRIBUTING.md.

MIT — see LICENSE.

联系我们 contact @ memedata.com