PHP 1亿行挑战

PHP 1亿行挑战
100M-Row Challenge with PHP

原始链接: https://github.com/tempestphp/100-million-row-challenge

## 1亿行PHP挑战总结一项PHP编码挑战正在进行中，要求参与者将1亿次页面访问的数据集（CSV格式）解析为结构化的JSON文件。挑战时间为2月24日至**2026年3月15日（CET 23:59）**。参与者fork提供的仓库，在`app/Parser.php`中实现解析方案，并通过pull request提交他们的工作。解决方案使用提供的工具进行本地验证（`composer install`，`php tempest data:generate`，`php tempest data:validate`）。JSON输出必须按URL路径分组，并按日期排序。提交将在一台专用服务器（Intel Digital Ocean Droplet，2vCPU，1.5GB RAM）上进行基准测试，并启用特定的PHP扩展。前三名最快、*原创*的解决方案将获得PhpStorm和Tideways赞助的奖品，包括其产品的许可证。结果将在`leaderboard.csv`中跟踪。人工验证和单次提交运行确保公平比较。鼓励参与者tag @brendt 或 @xHeaven 以获得支持或查询基准测试状态。

## 1000万行PHP性能挑战开发者brentroose在社区帮助下成功将一个脚本的优化时间从5天缩短到30秒以下后，为PHP社区发起了一项性能挑战。该挑战要求参与者使用PHP尽可能高效地解析1000万行数据。比赛为期两周，旨在成为一次有趣且协作的学习体验。奖品将授予表现最佳的参与者，包括备受追捧的PhpStorm Elephpant。该挑战在GitHub上托管（github.com/tempestphp），鼓励所有对突破性能极限感兴趣的PHP开发者参与。

原文

Important

The 100-million-row challenge is now live. You have until March 15, 11:59PM CET to submit your entry!

Welcome to the 100-million-row challenge in PHP! Your goal is to parse a data set of page visits into a JSON file. This repository contains all you need to get started locally. Submitting an entry is as easy as sending a pull request to this repository. This competition will run for two weeks: from Feb 24 to March 15, 2026. When it's done, the top three fastest solutions will win a prize!

To submit a solution, you'll have to fork this repository, and clone it locally. Once done, install the project dependencies and generate a dataset for local development:

composer install
php tempest data:generate

By default, the data:generate command will generate a dataset of 1,000,000 visits. The real benchmark will use 100,000,000 visits. You can adjust the number of visits as well by running php tempest data:generate 100_000_000.

Also, the generator will use a seeded randomizer so that, for local development, you work on the same dataset as others. You can overwrite the seed with the data:generate --seed=123456 parameter, and you can also pass in the data:generate --no-seed parameter for an unseeded random data set. The real data set was generated without a seed and is secret.

Next, implement your solution in app/Parser.php:

final class Parser
{
    public function parse(string $inputPath, string $outputPath): void
    {
        throw new Exception('TODO');
    }
}

You can always run your implementation to check your work:

Furthermore, you can validate whether your output file is formatted correctly by running the data:validate command. This command will run on a small dataset with a predetermined expected output. If validation succeeds, you can be sure you implemented a working solution:

php tempest data:validate

You'll be parsing millions of CSV lines into a JSON file, with the following rules in mind:

Each entry in the generated JSON file should be a key-value pair with the page's URL path as the key and an array with the number of visits per day as the value.
Visits should be sorted by date in ascending order.
The output should be encoded as a pretty JSON string.

As an example, take the following input:

https://stitcher.io/blog/11-million-rows-in-seconds,2026-01-24T01:16:58+00:00
https://stitcher.io/blog/php-enums,2024-01-24T01:16:58+00:00
https://stitcher.io/blog/11-million-rows-in-seconds,2026-01-24T01:12:11+00:00
https://stitcher.io/blog/11-million-rows-in-seconds,2025-01-24T01:15:20+00:00

Your parser should store the following output in $outputPath as a JSON file:

{
    "\/blog\/11-million-rows-in-seconds": {
        "2025-01-24": 1,
        "2026-01-24": 2
    },
    "\/blog\/php-enums": {
        "2024-01-24": 1
    }
}

Send a pull request to this repository with your solution. The title of your pull request should simply be your GitHub's username. If your solution validates, we'll run it on the benchmark server and store your time in leaderboard.csv. You can continue to improve your solution, but keep in mind that benchmarks are manually triggered, and you might need to wait a while before your results are published.

A note on copying other branches

You might be tempted to look for inspiration from other competitors. While we have no means of preventing you from doing that, we will remove submissions that have clearly been copied from other submissions. We validate each submission by hand up front and ask you to come up with an original solution of your own.

Prizes are sponsored by PhpStorm and Tideways. The winners will be determined based on the fastest entries submitted, if two equally fast entries are registered, time of submission will be taken into account.

All entries must be submitted before March 16, 2026 (so you have until March 15, 11:59PM CET to submit). Any entries submitted after the cutoff date won't be taken into account.

First place will get:

One PhpStorm Elephpant
One Tideways Elephpant
One-year JetBrains all-products pack license
Three-month JetBrains AI Ultimate license
One-year Tideways Team license

Second place will get:

One PhpStorm Elephpant
One Tideways Elephpant
One-year JetBrains all-products pack license
Three-month JetBrains AI Ultimate license

Third place will get:

One PhpStorm Elephpant
One Tideways Elephpant
One-year JetBrains all-products pack license

Where can I see the results?

The benchmark results of each run are stored in leaderboard.csv.

What kind of server is used for the benchmark?

The benchmark runs on a Premium Intel Digital Ocean Droplet with 2vCPUs and 1.5GB of available memory. We deliberately chose not to use a more powerful server because we like to test in a somewhat "standard" environment for PHP. These PHP extensions are available:

bcmath, calendar, Core, ctype, curl, date, dom, exif, fileinfo, filter, ftp, gd, gettext, gmp, hash, iconv, igbinary, imagick, imap, intl, json, lexbor, libxml, mbstring, memcached, msgpack, mysqli, mysqlnd, openssl, pcntl, pcre, PDO, pdo_mysql, pdo_pgsql, pdo_sqlite, pgsql, Phar, posix, random, readline, redis, Reflection, session, shmop, SimpleXML, soap, sockets, sodium, SPL, sqlite3, standard, sysvmsg, sysvsem, sysvshm, tokenizer, uri, xml, xmlreader, xmlwriter, xsl, Zend OPcache, zip, zlib, Zend OPcache

How to ensure fair results?

Each submission will be manually verified before its benchmark is run on the benchmark server. We'll also only ever run one single submission at a time to prevent any bias in the results. Additionally, we'll use a consistent, dedicated server to run benchmarks on to ensure that the results are comparable.

If needed, multiple runs will be performed for the top submissions, and their average will be compared.

Finally, everyone is asked to respect other participant's entries. You can look at others for inspiration (simply because there's no way we can prevent that from happening), but straight-up copying other entries is prohibited. We'll try our best to watch over this. If you run into any issues, feel free to tag @brendt or @xHeaven in the PR comments.

This challenge was inspired by the 1 billion row challenge in Java. The reason we're using only 100 million rows is because this version has a lot more complexity compared to the Java version (date parsing, JSON encoding, array sorting).

While testing this challenge, the JIT didn't seem to offer any significant performance boost. Furthermore, on occasion it caused segfaults. This led to the decision for the JIT to be disabled for this challenge.

The point of this challenge is to push PHP to its limits. That's why you're not allowed to use FFI.

How long should I wait for benchmark results to come in?

We manually verify each submission before running it on the benchmark sever. Depending on our availability, this means possible waiting times. If we haven't gotten to your submission within 24 hours, feel free to ping @brendt or @xHeaven in a comment to make sure we don't forget you.