英国生物库健康数据不断在GitHub上泄露。
UK Biobank health data keeps ending up on GitHub

原始链接: https://biobank.rocher.lc

此网页分析了英国生物库(UK Biobank)在GitHub上提交的公开DMCA删除通知。数据来源于github/dmca仓库,识别出英国生物库标记的、可能包含违反数据访问协议的参与者数据的仓库。 分析重点是文件名包含“uk-biobank”或通知文本中提及“UK Biobank”的通知。它提取提交日期和GitHub仓库URL,然后尝试确定与这些仓库相关的开发者的地理位置。这通过解析GitHub个人资料中的位置数据进行,如果不可用,则分析电子邮件域名。 在识别出的170名开发者中,只有75人的位置数据允许确定国家/地区。该项目会定期更新,*仅*报告公开DMCA通知中存在的信息——它不会对被标记仓库的内容做出任何判断。

## 英国生物库数据在GitHub持续泄露 一位追踪《数字千年版权法》(DMCA)通知的研究人员发现,英国生物库的健康数据在GitHub上反复出现。迄今为止,已有110份通知针对全球170位开发者的197个代码仓库。这一持续存在的问题凸显了生物库在数据治理方面面临的重大挑战,最近的报告甚至显示成员信息在阿里巴巴上出售。 这些删除通知主要集中在*特定*文件上——四分之一是遗传数据,很大一部分包含潜在敏感的表格健康记录——表明了一种策略,即狭义地定义版权侵权。研究人员认为,DMCA通知是针对可能未经授权访问数据的身份不明用户最后的手段。 *BMJ* 上发表的一篇相关社论详细阐述了这些担忧,并呼吁加强数据安全措施。
相关文章

原文

To build this webpage, I used data from the github/dmca repository, where GitHub publishes the full text of every DMCA takedown notice it receives. When a rights holder asks GitHub to remove content that infringes their copyright, the notice is posted publicly as a Markdown file in this repository. According to The Guardian, UK Biobank has used this process to request the removal of files or repositories that contain (or that it believes contain) participant data covered by its data access agreements.

To identify UK Biobank-related notices, I match filenames containing the slug "uk-biobank" (the convention GitHub uses when naming notice files). Just in case, I also search the full text of every other notice file for the phrases "UK Biobank" or "UKBiobank" (case-insensitive) to catch notices filed under different slugs, such as those submitted on behalf of UK Biobank. From each matching notice, I extract the filing date (parsed from the filename, which follows GitHub's YYYY-MM-DD-slug.md convention) and all GitHub repository URLs mentioned in the notice body. URLs pointing to GitHub's own infrastructure (e.g. github.com/contact or github.com/site) are excluded.

For each unique GitHub username found in the notices, I query the GitHub REST API (GET /users/{username}) to retrieve the user's public profile, specifically the self-reported location field. This is a free-text string that users enter voluntarily. It may be a city, a country, a university name, or left blank entirely. Deleted accounts return a 404 and are not included further.

I derive countries from the raw location strings by hand. When a user's GitHub profile does not include a location, I also determine their country by inspecting their GitHub profile and associated email address domains. This process is inherently imperfect: some locations are ambiguous (e.g. "Cambridge" could refer to the UK or the US), and many users do not provide any location at all. Of the 170 unique developers in the dataset, only 75 have a location that could be resolved to a country.

The data is regularly refreshed by re-running the collection script against the latest state of the github/dmca repository. This page does not make any claims about the content of the targeted repositories, including whether they contained actual participant data, derived datasets, analysis code, or just documentation. It reports only what is visible in the public DMCA notices filed by UK Biobank.

联系我们 contact @ memedata.com