Show HN:TokenDagger——比OpenAI的Tiktoken更快的分词器
Show HN: TokenDagger – A tokenizer faster than OpenAI's Tiktoken

原始链接: https://github.com/M4THYOU/TokenDagger

TokenDagger 是一个高性能的、可直接替换 OpenAI 的 TikToken 的工具,专为大规模文本处理而设计。在 AMD EPYC 4584PX 上的基准测试显示,它的吞吐量提高了 2 倍,代码标记速度更是提升了 4 倍。 其关键优化包括:采用快速的基于 PCRE2 的正则表达式引擎,以高效地进行标记模式匹配;以及简化的字节对编码 (BPE) 算法,以最大限度地减少大型特殊标记词汇表带来的性能开销。 TokenDagger 完全兼容 OpenAI 的 TikToken,可以无缝集成,无需修改代码。安装说明和性能基准测试(使用 llama 和 mistral 标记器)均已提供,证实了其性能优势,尤其是在代码标记任务中。用户可以轻松切换到 TokenDagger,以实现更快、更高效的文本处理。

TokenDagger, a C++ 17-based tokenizer, has been introduced as a drop-in replacement for OpenAI's Tiktoken. The author, matthewolfe, highlights significant performance improvements, achieving up to 4x faster tokenization on a single thread and 2-3x higher throughput on large text files. The speed gains are attributed to a faster JIT-compiled regex engine and a simplified algorithm that avoids regex matching for special tokens. The author emphasizes the importance of optimizing LLM performance, aligning with the principle of "Make it work, make it fast, make it pretty." The community reaction is positive, with discussions around the benefits of C++ for AI infrastructure and the potential for further optimization. While some argue that tokenization isn't typically the bottleneck in LLM performance, others recognize the value of a faster tokenizer, especially when dealing with large datasets. The author has already addressed feedback, ensuring TokenDagger is now a fully compatible drop-in replacement.
相关文章

原文

License: MIT Python 3.8+ PyPI version

A fast implementation of OpenAI's TikToken, designed for large-scale text processing. 2x Throughput and 4x faster on code sample tokenization.

Performed on an AMD EPYC 4584PX - 16c/32t - 4.2 GHz.

Throughput Benchmark Results

  • Fast Regex Parsing: Optimized PCRE2 regex engine for efficient token pattern matching
  • Simplified BPE: Simplied algorithm to reduce performance impact of large special token vocabulary.
  • OpenAI Compatible: Full compatibility with OpenAI's TikToken tokenizer
make clean && make
pip3 install tiktoken
python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer llama
python3 tests/test_tokendagger_vs_tiktoken.py --tokenizer mistral
python3 tests/performance_benchmark.py --tokenizer llama
python3 tests/performance_benchmark.py --tokenizer mistral
python3 tests/code_performance_benchmark.py --tokenizer llama
================================================================================
🎉 CONCLUSION: TokenDagger is 4.02x faster on code tokenization!
================================================================================
git clone [email protected]:M4THYOU/TokenDagger.git
sudo apt install libpcre2-dev
git submodule update --init --recursive
sudo apt update && sudo apt install -y python3-dev

And optionally for running the tests:

  • PCRE2: Perl Compatible Regular Expressions - GitHub
联系我们 contact @ memedata.com