一个过度设计的解决方案,用于代替 `sort | uniq -c`,吞吐量提升25倍(历史数据)。
An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

原始链接: https://github.com/noamteyssier/hist-rs

## Hist:一个快速的唯一行计数器 `hist` 是一个命令行工具,旨在高效地统计文件或标准输入中的唯一行数。它复制了 `cat | sort | uniq -c | sort -n` 的功能,但性能显著提升。 主要功能包括选项:根据模式排除/包含行 (`-e`, `-i`),按行丰度过滤 (`-m`, `-M`),以及控制输出排序 (`-n`, `-d`)。它还可以简单地输出唯一行而不进行计数 (`-u`)。 性能测试,使用一个 100M 行的 FASTQ 文件,展示了 `hist` 的速度优势。`hist` 完成任务耗时约 200 毫秒,优于 `cuniq` (~434 毫秒),`huniq` (~2375 毫秒),`sortuniq` (~2593 毫秒) 和一个朴素实现 (~5409 毫秒),优势明显。这使得 `hist` 成为高效分析大型数据集的宝贵工具。

相关文章

原文

MIT licensed Crates.io

A high-throughput CLI to count unique lines.

This is a standalone tool with equivalent functionality to cat <file> | sort | uniq -c | sort -n.

# count unique lines in a file
hist <file>

# count unique lines from stdin
/bin/cat <file> | hist

# skip counting and just write unique lines
hist <file> -u

# exclude lines matching a pattern while counting
hist <file> -e <pattern>

# include lines matching a pattern while counting
hist <file> -i <pattern>

# only output lines with abundance greater than or equal to a threshold
hist <file> -m <threshold>

# only output lines with abundance less than or equal to a threshold
hist <file> -M <threshold>

# sort output by the key (default: by abundance)
hist <file> -n

# sort output in descending order (default: ascending)
hist <file> -d

I use nucgen to generate a random 100M line FASTQ file and pipe it into different tools to compare their throughput with hyperfine.

I am measuring the performance of equivalent cat <file> | sort | uniq -c | sort -n functionality.

Tools compared:

Command Mean [ms] Min [ms] Max [ms] Relative
hist 200.3 ± 3.3 195.6 208.7 1.00
cuniq 434.3 ± 6.6 424.7 442.9 2.17 ± 0.05
huniq 2375.5 ± 43.8 2328.1 2450.3 11.86 ± 0.30
sortuniq 2593.2 ± 28.4 2535.7 2640.9 12.95 ± 0.26
naive 5409.9 ± 23.3 5378.0 5453.3 27.01 ± 0.47
联系我们 contact @ memedata.com