研究团队数字化了超过100年的加拿大传染病数据。
Research team digitizes more than 100 years of Canadian infectious disease data

原始链接: https://news.mcmaster.ca/mcmaster-research-team-digitizes-more-than-100-years-of-canadian-infectious-disease-data/

经过25年的努力,麦克马斯特大学的研究人员创建了CANDID——加拿大可报告疾病发生率数据集,这是一个包含超过一百万条传染病记录的综合数据库,数据可追溯到1903年。戴维·恩教授最初在被忽视的卫生部存储区域发现了原始资料——数十年的手写报告,克服了最初获取历史数据的阻力。 该数据集包括脊髓灰质炎、麻疹和结核病等疾病的每周、每月和每季度病例数,涵盖加拿大所有省份和地区。这个“美丽的数据集”使研究人员能够分析过去的爆发、模拟疾病传播并了解长期趋势。 目前,公众获取加拿大传染病数据的途径有限,仅发布年度全国统计数据。恩认为,在优先保护患者隐私的前提下,增加数据共享对于改善疫情准备和应对至关重要。CANDID现在已公开可用,为流行病学家提供了一项宝贵的资源,以便从过去吸取教训并加强未来的公共卫生策略。

一个研究团队数字化了超过一个世纪的加拿大传染病数据,引发了 Hacker News 的讨论。该项目与麦克马斯特大学和 CanMod 相关联,使大量历史数据可用于分析。 一个关键问题是,能否利用这个数据集创建有用的工具,一位用户引用了医疗定价数据集作为可能的例子。回复内容从对数据安全的玩笑警告(“你想要电脑病毒吗?”)到对数据库本身的认真讨论(链接在评论中)不等。 一位评论员表达了中国对类似数据的需求,并提到了全球健康问题。这项举措凸显了开放数据对公共卫生研究和潜在预防措施的价值。
相关文章

原文

Twenty-five years ago, in a neglected storage area at the Ontario Ministry of Health, David Earn happened upon epidemiological gold: two boxes of hand-written documents accounting for 50 years of weekly infectious disease incidence reports, spanning 1939-1989.  

The buried treasure was exactly the sort of thing that the McMaster University professor hoped to unearth during his visit — historical public health data that could help contextualize current and future infectious disease outbreaks.  

“Initially, the Ministry said that they couldn’t provide the data — that they didn’t have the time to search through their archives for us,” recalls Earn, a professor in McMaster’s Department of Mathematics and Statistics. “So, I offered to come to Toronto and look through their files myself, if they would let me. I basically begged, insisting on the value of the historical records, and I wouldn’t let it go. Eventually, I guess I became too much of a nuisance and they relented.”   

David Earn works at a computer, with a stack files beside him, one clearly labeled '1903 to 1939 monthly communicable disease incidence, Ontario.'

The documents uncovered that day catalyzed a massive retrospective research project that has culminated in a complete, province-by-province inventory of Canadian infectious disease records.    

The result, published today in PLOS Global Public Healthis what Earn describes as a “genuinely beautiful dataset” that strings together more than 100 years of historical epidemiological information.  

Altogether, the new database — the Canadian Notifiable Disease Incidence Dataset, or “CANDID” — contains more than a million infectious disease incidence counts that date back as far as 1903.   

The dataset, which is now publicly accessible, captures weekly, monthly, and quarterly case numbers for diseases like poliomyelitis, hepatitis, tuberculosis, whooping cough, influenza, rubella, mumps, measles, and many others, and tracks their spread in each province and territory across time.   

A collage of historical disease records.

“Data like these reveal the speed and shape of outbreaks and recurrent epidemics of the past, and allow us to test models that predict patterns of spread,” Earn says. “This new dataset can be leveraged to understand the ecology and evolution of infectious disease across Canada’s history, and to help us prepare for emerging and re-emerging diseases in the future.”  

In fact, Earn’s team has already used the database to better understand the spatial and temporal incidence of polio and whooping cough across several decades of Canadian history.  

While the new study was 25 years in the making, Earn says it really accelerated in 2021, when a large pandemic-related NSERC network grant allowed him to recruit Steven Walker, a former McMaster postdoctoral fellow, to his team.  

Walker, who re-joined McMaster as a data scientist in Earn’s group, was tasked with curating, cleaning, and harmonizing the troves of data that Earn and his associates had previously unearthed from libraries, public health offices, and provincial and federal agencies based all across Canada.  

“We would start with scans of handwritten or typewritten documents and manually transcribe them into Microsoft Excel to ensure that we had functional replicas of every original document,” Walker explains. “But the replicas aren’t conducive to data analysis, due to inconsistent formatting, so we’ve also been developing flexible data structures that are more convenient for analysis and discovery.” 

Earn, a member of the Michael G. DeGroote Institute for Infectious Disease Research, hopes that the new dataset — and the herculean efforts to assemble it — will help spur important changes to Canada’s current infectious disease reporting standards, noting that the public release of infectious disease data is arguably worse now than it was at any point during the 20th century, including the pre-digital era.   

David Earn stands at a table in a living room, looking over multiple papers and files spread on the table.

In fact, today, the Public Health Agency of Canada issues only annual, nationally aggregated incidence counts — not weekly or regional information — which limits opportunity for important studies into epidemic patterns, seasonal effects, and geographic variation.  

Earn says that the reduced resolution in today’s data is due in large part to patient privacy protection — a critically important consideration, but one that Earn believes can be maintained even with increased sharing of useful data.  

“It is extremely important to protect patient privacy, and our federal, provincial, and territorial agencies have developed protocols for data release that aim to ensure privacy is protected,” he says. “But there is no individual-level information in aggregate counts of infectious disease cases, and no identifying information can be extracted from these data. I think that current data release protocols should be thoughtfully and carefully reconsidered, so that they still prioritize privacy, but also allow for the release of more useful information, which could help us to prepare for future outbreaks — to the benefit of all Canadians.” 

In the meantime, Earns group encourages epidemiologists in Canada and elsewhere to use CANDID to study the patterns of disease incidence, to learn from historical surveillance efforts, and to strengthen public health preparedness. 

联系我们 contact @ memedata.com