一个过度设计的解决方案，用于代替 `sort | uniq -c`，吞吐量提升25倍（历史数据）。

一个过度设计的解决方案，用于代替 `sort | uniq -c`，吞吐量提升25倍（历史数据）。
An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

原始链接: https://github.com/noamteyssier/hist-rs

## Hist：一个快速的唯一行计数器 `hist` 是一个命令行工具，旨在高效地统计文件或标准输入中的唯一行数。它复制了 `cat | sort | uniq -c | sort -n` 的功能，但性能显著提升。主要功能包括选项：根据模式排除/包含行 (`-e`, `-i`)，按行丰度过滤 (`-m`, `-M`)，以及控制输出排序 (`-n`, `-d`)。它还可以简单地输出唯一行而不进行计数 (`-u`)。性能测试，使用一个 100M 行的 FASTQ 文件，展示了 `hist` 的速度优势。`hist` 完成任务耗时约 200 毫秒，优于 `cuniq` (~434 毫秒)，`huniq` (~2375 毫秒)，`sortuniq` (~2593 毫秒) 和一个朴素实现 (~5409 毫秒)，优势明显。这使得 `hist` 成为高效分析大型数据集的宝贵工具。

A high-throughput CLI to count unique lines.

This is a standalone tool with equivalent functionality to cat <file> | sort | uniq -c | sort -n.

# count unique lines in a file
hist <file>

# count unique lines from stdin
/bin/cat <file> | hist

# skip counting and just write unique lines
hist <file> -u

# exclude lines matching a pattern while counting
hist <file> -e <pattern>

# include lines matching a pattern while counting
hist <file> -i <pattern>

# only output lines with abundance greater than or equal to a threshold
hist <file> -m <threshold>

# only output lines with abundance less than or equal to a threshold
hist <file> -M <threshold>

# sort output by the key (default: by abundance)
hist <file> -n

# sort output in descending order (default: ascending)
hist <file> -d

I use nucgen to generate a random 100M line FASTQ file and pipe it into different tools to compare their throughput with hyperfine.

I am measuring the performance of equivalent cat <file> | sort | uniq -c | sort -n functionality.

Tools compared:

Command	Mean [ms]	Min [ms]	Max [ms]	Relative
`hist`	200.3 ± 3.3	195.6	208.7	1.00
`cuniq`	434.3 ± 6.6	424.7	442.9	2.17 ± 0.05
`huniq`	2375.5 ± 43.8	2328.1	2450.3	11.86 ± 0.30
`sortuniq`	2593.2 ± 28.4	2535.7	2640.9	12.95 ± 0.26
`naive`	5409.9 ± 23.3	5378.0	5453.3	27.01 ± 0.47

一个过度设计的解决方案，用于代替 `sort | uniq -c`，吞吐量提升25倍（历史数据）。 An overengineered solution to `sort | uniq -c` with 25x throughput (hist)

一个过度设计的解决方案，用于代替 `sort | uniq -c`，吞吐量提升25倍（历史数据）。
An overengineered solution to `sort | uniq -c` with 25x throughput (hist)