科学数据集充斥着复制粘贴错误。

科学数据集充斥着复制粘贴错误。
Scientific datasets are riddled with copy-paste errors

原始链接: https://www.sciencedetective.org/scientific-datasets-are-riddled-with-copy-paste-errors/

##帕金森氏症及其他研究中的数据完整性问题一款旨在检测数据异常的软件——受先前科学不端行为案例启发——在多个已发表的数据集中发现了令人担忧的问题，包括一项具有里程碑意义的2016年论文，该论文提出帕金森病起源于肠道。这项被高度引用的研究（超过3000次引用）在不同的实验组中包含重复的数据点，由于样本量小，可能影响其结论。重复值占受影响样本的很大一部分（高达50％）。该软件还发现了研究动物毒素抵抗性和鱼类行为中的问题。在一个案例中，鱼的长度数据被混淆，将一个鱼的测量值分配给其他鱼。作者承认这是一个文件连接错误，并且经过微小修正的分析结果几乎没有改变发现。另一项研究因近乎重复的数据而受到审查，首席作者将其归因于仪器限制和测量变化，但这种解释仍存在争议。到目前为止，在扫描的前600个数据集中，已经发现了18个令人担忧的案例，表明错误率约为3％——可能被低估了。这些发现凸显了科学监督中的一个关键差距，因为目前机构和期刊并未优先进行常规数据验证。该项目现已获得全额资助，旨在扫描Dryad（一个公共数据存储库）中的剩余24,000个数据集。

一个黑客新闻的讨论强调了科学数据集中普遍存在的复制粘贴错误（链接至 sciencedetective.org）。一位致力于提高科学数据质量的评论员解释说，问题不在于科学家缺乏勤奋，而在于科学工作流程复杂且通常是定制化的，以及实施可靠的质量保证的难度。他将这个问题比作过于简单的科技创业想法——*看起来*很容易解决，但现实却极其复杂。科学家的技术水平差异很大，在众多参与者和数据集之间维护数据的正确性是“残酷的”。他分享了一个关于生物多样性数据验证工具揭示了广泛问题，这些问题并非由于工具故障，而是由于数据规范本身的复杂性，以及公共存储库中出乎意料的宽松验证。虽然错误通常很小，但它们代表了科学数据管理中一个重大且系统性的问题。

The above data comes from a landmark paper in Parkinson's Disease research, which provided the first-ever evidence that the disease originates in the gut rather than the brain. The paper received media coverage from major outlets and has amassed over 3000 citations from other scientific papers. But the underlying data contains sequences of duplicated values that should belong to completely different individual mice. The dataset has been publicly available on Dryad - an open-access repository where scientists upload their raw data - for more than 8 years. Why didn't anyone notice the blatant copy-paste errors until now?

Before going into more detail about this case, let me give some background on how we detected the issue: It was flagged by a piece of software I started building last year, which was inspired by two cases of data fabrication that made the news in recent years. One by Nobel laureate Thomas Südhof's lab and one by spider ecologist Jonathan Pruitt. Both cases had publicly available datasets with entire blocks of copy-pasted data that seemed quite trivial to detect. I was curious what I could dig up by creating a program that would correctly flag those cases, and then unleash it on all datasets available in open-access repositories.

Together with a few volunteer contributors, we've finished reporting all cases from the first 600 datasets we've scanned. There were 18 cases we felt were serious enough to raise concerns. Here are 3 of the most exciting ones:

We took mice that were genetically predisposed to developing symptoms of Parkinson's and we just cleared out their microbiome - all their symptoms went away.

That's how the paper's senior author Sarkis Mazmanian summarized the findings of the study when he went on the Rich Roll Youtube channel:

科学数据集充斥着复制粘贴错误。
Scientific datasets are riddled with copy-paste errors

Mohammadi's response

My response

科学数据集充斥着复制粘贴错误。 Scientific datasets are riddled with copy-paste errors

Mohammadi's response

My response

科学数据集充斥着复制粘贴错误。
Scientific datasets are riddled with copy-paste errors