十亿行挑战
The One Billion Row Challenge

原始链接: https://www.morling.dev/blog/one-billion-row-challenge/

软件工程师兼博主 Gunnar Morling 展示了“对所有软件工程的随机思考”。 在他最新的题为“十亿行挑战”的文章中,他向其他编码人员发起挑战,要求他们使用 Java 开发尽可能最快的解决方案,从包含超过十亿行的海量数据集中检索温度数据。 比赛涉及计算文本文件中每个气象站记录的最低、平均和最高温度。 所有提交的作品都将根据其处理时间进行评估,只有顶尖的竞争者才有资格获得认可、灵感,甚至可能获得独特的 T 恤奖。 要参与,请提供可下载的起始代码以及 Gunnar 网站上列出的具体规则。 这项 2024 年编程计划承诺发现 Java 框架内可以实现的限制和优化。 尘埃落定后,让我们看看谁能取得胜利。

关于 1BRc 挑战中的内存映射,由于各种因素,使用内存映射文件相对于传统输入方法(读取和解析作为单独步骤)的性能增益仍然存在争议。 虽然理论上内存映射文件应该在减少磁盘访问开销以及通过完全避免磁盘查找和旋转操作来改善延迟和带宽方面提供显着的优势,但现实世界的场景往往会因为高速缓存未命中成本和潜在的同步 I/O 成本的增加而抵消这些优势。由页面窃取行为引起的瓶颈,特别是对于在物理内存容量有限和共享资源分配方案的商用服务器上运行的大型数据集。 此外,通过仔细配置虚拟内存设置来设置内存映射和管理内存压力条件的高度复杂性使该选项进一步复杂化,使其更难以高效且有效地实施。 最终,选择内存映射还是传统输入机制最终取决于管理给定用例的特定要求和约束,包括数据集的性质和结构以及目标计算环境的特征和功能。
相关文章

原文

Update Jan 4: Wow, this thing really took off! 1BRC is discussed at a couple of places on the internet, including Hacker News, lobste.rs, and Reddit.

Thanks a lot for all the submissions, this is going way beyond what I’d have expected! I am behind a bit with evalutions due to the sheer amount of entries, I will work through them bit by bit. I have also made a few clarifications to the rules of the challenge; please make sure to read them before submitting any entries.

Let’s kick off 2024 true coder style—​I’m excited to announce the One Billion Row Challenge (1BRC), running from Jan 1 until Jan 31.

Your mission, should you decide to accept it, is deceptively simple: write a Java program for retrieving temperature measurement values from a text file and calculating the min, mean, and max temperature per weather station. There’s just one caveat: the file has 1,000,000,000 rows!

The text file has a simple structure with one measurement value per row:

1
2
3
4
5
6
Hamburg;12.0
Bulawayo;8.9
Palembang;38.8
St. John's;15.2
Cracow;12.6
...

The program should print out the min, mean, and max values per station, alphabetically ordered like so:

1
{Abha=5.0/18.0/27.4, Abidjan=15.7/26.0/34.1, Abéché=12.1/29.4/35.6, Accra=14.7/26.4/33.1, Addis Ababa=2.1/16.0/24.3, Adelaide=4.1/17.3/29.7, ...}

The goal of the 1BRC challenge is to create the fastest implementation for this task, and while doing so, explore the benefits of modern Java and find out how far you can push this platform. So grab all your (virtual) threads, reach out to the Vector API and SIMD, optimize your GC, leverage AOT compilation, or pull any other trick you can think of.

There’s a few simple rules of engagement for 1BRC (see here for more details):

  • Any submission must be written in Java

  • Any Java distribution available through SDKMan as well as early access builds from openjdk.net may be used, including EA builds for OpenJDK projects like Valhalla

  • No external dependencies may be used

To enter the challenge, clone the 1brc repository from GitHub and follow the instructions in the README file. There is a very basic implementation of the task which you can use as a baseline for comparisons and to make sure that your own implementation emits the correct result. Once you’re satisfied with your work, open a pull request against the upstream repo to submit your implementation to the challenge.

All submissions will be evaluated by running the program on a Hetzner Cloud CCX33 instance (8 dedicated vCPU, 32 GB RAM). The time program is used for measuring execution times, i.e. end-to-end times are measured. Each contender will be run five times in a row. The slowest and the fastest runs are discarded. The mean value of the remaining three runs is the result for that contender and will be added to the leaderboard. If you have any questions or would like to discuss any potential 1BRC optimization techniques, please join the discussion in the GitHub repo.

As for a prize, by entering this challenge, you may learn something new, get to inspire others, and take pride in seeing your name listed in the scoreboard above. Rumor has it that the winner may receive a unique 1️⃣🐝🏎️ t-shirt, too.

So don’t wait, join this challenge, and find out how fast Java can be—​I’m really curious what the community will come up with for this one. Happy 2024, coder style!

联系我们 contact @ memedata.com