他让AI计算碳水化合物27000次,但它无法给出两次相同的答案。
He asked AI to count carbs 27000 times. It couldn't give the same answer twice

原始链接: https://www.diabettech.com/i-asked-ai-to-count-my-carbs-27000-times-it-couldnt-give-me-the-same-answer-twice/

## AI碳水计算:糖尿病管理可靠性危机 一项最新研究显示,领先的AI模型(GPT-5.4、Claude和Google Gemini)在分析食物照片时,碳水化合物估算存在显著且危险的不一致性——这是许多糖尿病管理应用程序的核心功能。研究人员使用标准提示,向每个模型提交了13张食物图片,进行了超过500次测试,发现即使对于*同一*张照片和模型,碳水化合物计数也存在很大差异。 这些差异并非微小;估算值相差数十克,可能导致危险的胰岛素过量或不足——其中一个模型的西班牙海鲜饭照片显示范围高达429克。虽然Claude表现出最一致性,但所有模型都存在不准确之处,并且其报告的“置信度”分数与实际准确性之间缺乏令人担忧的相关性。 该研究强调了两个主要风险:**系统性偏差**(持续高估,可能导致低血糖)和**随机变异性**(不可预测的异常值,造成急性危险)。研究结果强烈表明,当前的AI碳水计算工具*不足以*用于无人监督的胰岛素剂量调整,并强调需要多次查询以评估不确定性,以及仔细验证识别出的食物项目。该研究支持了近期对使用通用LLM进行自主胰岛素计算的警告。

## AI 与碳水计算:一个虚假的承诺 最近的实验显示,人工智能在从食物图像中估算碳水化合物含量方面存在显著的不一致性。一位作者使用同一张三明治照片向人工智能提问了 27,000 次,得到的碳水化合物计数差异很大,突出了当前大型语言模型技术的固有局限性。 核心问题在于,单张图像缺乏准确营养分析所需的信息——隐藏的配料和不同的烹饪方法会极大地改变碳水化合物计数。评论员强调,大型语言模型并非“神奇的预言家”,即使对人类来说也很难完成的任务,它们也难以胜任。 尽管存在这些局限性,像 Cal AI(最近被 MyFitnessPal 收购)这样的应用程序正在获得关注,这得益于误导性的营销,将人工智能描绘成解决日常任务的方案。人们对依赖这些工具来做健康决定表示担忧,尤其是在自动胰岛素输送等关键应用中,不准确的估算可能很危险。这场讨论强调了对人工智能进行更好教育以及对其能力抱有现实期望的必要性。
相关文章

原文

Ask ChatGPT to estimate the carbs in your lunch. Now ask it again. And again. Five hundred times.

You’d expect the same answer each time. It’s the same photo, the same model, the same question. But you won’t get the same answer. Not even close — and the differences are large enough to cause a hypoglycaemic emergency.

That’s the central finding of a study I’ve just published as a preprint, and it has direct implications for anyone using AI-powered carb counting in a diabetes app.

The study

I submitted 13 food photographs — real meals, photographed on a phone, the way you’d actually use them — to four leading AI models: OpenAI GPT-5.4, Anthropic Claude Sonnet 4.6, Google Gemini 2.5 Pro and Google Gemini 3.1 Pro Preview. Each photo was sent over 500 times to each model. Same prompt every time. Same photo. Same settings.

26,904 queries in total. All at the lowest randomness setting these models offer.

The prompt was adapted from the one used in the iAPS open-source automated insulin delivery system — it’s a real production prompt, not a toy example.

The models disagree with themselves

Every model returned different carbohydrate estimates for the same photo across repeated queries. But the degree of disagreement varies enormously.

How much does each model disagree with itself?
How much does each model disagree with itself?

Each dot is one of the 13 test images. The violin shape shows the spread. Claude’s variation clusters below 5% for most images; the Gemini models regularly exceed 10-20%.

ModelMedian variation (CV)Median insulin swingWorst-case insulin swing
Claude Sonnet 4.62.4%0.9 U13.6 U
GPT-5.48.4%2.3 U16.6 U
Gemini 3.1 Pro10.3%2.9 U16.2 U
Gemini 2.5 Pro11.0%4.7 U42.9 U

The worst case? The paella photo. Here’s what happened when I sent it to each model 500+ times:

One photo of paella, 2000+ answers
One photo of paella, 2000+ answers

Every dot is one query. Same photo. Same prompt. Same model. Gemini 2.5 Pro’s estimates span from 55g to 484g — a 429g range, equivalent to 42.9 units of insulin at a 1:10 ICR. Claude’s estimates cluster tightly by comparison.

42.9 units of insulin from a single photo. That’s not a rounding error. That’s a potential fatality.

This variation is invisible to you

When you take a photo in a diabetes app, you get one number back. You have absolutely no way to know whether you received a typical estimate or a tail-end outlier from a distribution you can’t see. For Claude, that single number is probably close to the model’s consensus. For Gemini 2.5 Pro, you could be anywhere on the map.

The cheese sandwich that
defeats AI

A cheese sandwich on a plate

Here’s one that should be easy. Two slices of thick white bread (carbs on the packet: 20g per slice) plus cheddar cheese (negligible carbs). Reference value: 40g. Simple, unambiguous, packet-label accuracy.

The cheese sandwich problem
The cheese sandwich problem

Left: Three models independently converge on ~28g — consistently wrong by 12g. Right: GPT-5.4 estimates ~74g with high variability — wrong in the other direction by 34g. The red dashed line is the actual value.

Three of four models — Claude, Gemini 2.5 Pro and Gemini 3.1 Pro — independently converge on approximately 28g for a 40g meal. 510 queries from Claude, CV of 0.3%, and every single one is 12g below the actual value. The bread is right there in the photo. The carb value is on the packet.

This is the “precisely wrong” problem: high consistency doesn’t guarantee accuracy. A diabetes app user getting 28g every time would consistently underdose by ~1.2 units.

GPT-5.4 goes the other way: mean estimate 74g, nearly double the reference, and highly variable on top of it.

The models don’t always know what they’re looking at

I found food identification errors in 8 of the 13 test images:

  • Bakewell tart: Claude called it a “Linzer torte” in 100% of 510 queries. GPT-5.4 called it a “jam tart” or “cake bar.” Only Gemini 3.1 Pro correctly named it (99.8%).
  • Crema catalana: Three of four models called it “creme brulee” 100% of the time. Only Gemini 3.1 Pro got “crema catalana” — in 3.4% of queries.
  • Cheese sandwich: Gemini 3.1 Pro added non-existent “deli meat” in 17.4% of queries — hallucinating an ingredient that isn’t there. This could directly inflate carbohydrate estimates.

Some of these misidentifications have modest nutritional impact. Others could change the carbohydrate estimate substantially.

Where does your insulin dose actually land?

On the five images where I had the strongest reference values (packet labels and weighed portions), here’s how often each model’s individual queries would have pushed insulin doses into clinically dangerous territory:

Insulin dosing risk by model
Insulin dosing risk by model

Green is safe (<1U error). Yellow is moderate (1-2U). Orange is clinically significant (2-5U). Red is severe hypo risk (>5U). Claude is the only model with no queries in the orange or red zones.

Claude: 100% of queries in the safe or moderate zone. No single query would have caused more than a 2-unit insulin error.

GPT-5.4: 37% of queries would cause a clinically significant insulin error (>2U). That’s more than one in three queries landing in the danger zone.

Gemini 3.1 Pro Preview: 12% of queries would cause a clinically significant insulin error (>2U). Better than ChatGPT-5.4.

Gemini 2.5 Pro: 12% of queries would cause a >5U error — the threshold associated with severe hypoglycaemia requiring third-party assistance.

Two types of risk

The study identifies two distinct failure modes:

Systematic bias (chronic risk). All four models overestimate carbs on average, meaning the dominant direction of error is toward too much insulin and hypoglycaemia. GPT-5.4 averages +1.2 units overdose per meal on strong-reference foods. Three meals a day, that’s 3.6 units of extra insulin per day.

Stochastic variability (acute risk). The within-image variation means a single unlucky query could produce a catastrophic outlier. Gemini 2.5 Pro’s worst single query on strong-reference data would have caused an 11.3 unit insulin overdose for a 34g meal. That’s a potential severe hypo.

“But the AI said it was confident”

The prompt I used asks each model to return a confidence score (0 to 1) for every food item it identifies. All four models dutifully returned confidence scores for 100% of items. Surely we can use those to filter out bad estimates?

No. We can’t.

ModelMean confidenceCorrelation with actual accuracy
Claude Sonnet 4.60.80r = -0.01 (zero correlation)
GPT-5.40.78r = -0.17 (weak)
Gemini 3.1 Pro0.91r = -0.11 (weak)
Gemini 2.5 Pro0.91r = -0.14 (weak)

Claude’s confidence has literally zero correlation with whether it’s right or wrong. It reports ~0.80 confidence whether it’s nailing the bakery cookie (MAE 1.9g) or getting the cheese sandwich wrong by 12g. Worse: when Claude reports high confidence (above 0.85), its estimates are actually less accurate (MAE 17.3g) than when it reports lower confidence (MAE 9.1g). The confidence score is worse than useless — it’s actively misleading.

Gemini 2.5 Pro reports confidence above 0.9 for 86% of all food items, and Gemini 3.1 Pro for 76%, regardless of whether the estimate is anywhere near correct. That’s not calibrated uncertainty. That’s a model saying “I’m very sure” about everything.

here is one faint signal: Claude’s mean confidence does vary slightly by image — the churros (its most misidentified food) get 0.65, while the crema catalana gets 0.92. But the range is so narrow and the calibration so poor that no diabetes app could meaningfully use these scores to protect users.

The bottom line: the only reliable uncertainty signal comes from querying multiple times and observing the spread. The model’s own confidence score is not a safety mechanism.

What this means for you

The DTN-UK stated earlier this year that generic LLMs must never be used as autonomous advisory calculators for insulin delivery. This data is the quantitative evidence base for that statement.

If you’re using AI carb counting in a diabetes app:

  1. Don’t trust it blindly. No model tested is safe for unsupervised insulin dosing. Not even Claude.


  2. Ask it more than once. Query 3-5 times and look at the spread. If the answers vary wildly, the model is uncertain — even if it doesn’t tell you that.


  3. Check what it thinks it sees. If the model identifies “chicken with stuffing” for your grilled fish, you want to know that BEFORE it calculates your carbs.


  4. Know your model. The four-fold difference in consistency is real. Claude Sonnet 4.6 is the safest single-query option today, but it can still be precisely wrong.


  5. An AI that gives you the same wrong answer 500 times is not more trustworthy than one that varies. Consistency is necessary but not sufficient.

The preprint

The full paper — Reproducibility and accuracy of large language model vision APIs for carbohydrate estimation from food photographs — is available as a preprint PDF here and is being submitted to Diabetologia for peer review. The complete dataset (26,904 query results), analysis code, and test images are available in the study repository (access on request).

Supplementary data:S1: Full prompt text used for all queriesS2: Distributional statistics for all 52 model-image combinationsS3: Per-image accuracy breakdownS4: Food identification analysis



Discover more from Diabettech - Diabetes and Technology

Subscribe to get the latest posts sent to your email.

联系我们 contact @ memedata.com