现代硬件的算法

现代硬件的算法
Algorithms for Modern Hardware

标题：“现代硬件算法” - 第一部分概述：性能工程这本即将出版的书《现代硬件算法》由 Sergey Slotin 撰写，旨在向包括从业者、研究人员和本科生在内的各种受众教授性能工程。该内容可在 GitHub 上找到，相关代码位于单独的存储库中。它专注于优化专为现代硬件设计的算法，同时提供超越渐近复杂性的简单改进的解决方案。涵盖的关键主题包括指令集架构、管道风险、无分支编程、编译阶段、分析方法以及浮点数、IEEE 754 浮点数和数论算法等算术。大量案例研究表明，与标准库函数相比，速度显着提高。第一部分大约有 450-600 页，预计发布日期为 2022 年第三季度，主要讨论性能工程基础知识。

我完全同意。通常，简单地退后一步并批判性地思考问题就可以带来显着的改进。事实上，这是我书中的一个关键主题——应用数学见解和严格的分析来确定性能问题的根本原因并设计有效的解决方案。至于计算资源的问题，是的，存在限制，但通常，限制因素不是原始计算能力，而是有效使用计算能力的能力。这就是并行化、矢量化和专用架构等技术的用武之地，使我们能够从可用资源中获取最大价值。但最终，你的回报会递减，而进一步的收益需要在硬件和软件进步方面付出巨大的努力和投资。近年来，机器学习和深度神经网络的进步凸显了高效计算的重要性，并推动了 HPC 和边缘计算等领域的巨大增长。这是计算机科学和工程领域激动人心的时刻！如果您喜欢阅读此讨论并发现其富有洞察力，请考虑分享并帮助传播！另外，请随时提出问题、提供建议或在未来的主题中参与进一步讨论。谢谢！顺便说一句，有一本关于这个主题的优秀书，名为《数学家的哀叹：学校如何欺骗我们最具想象力和最重要的艺术形式》，作者是保罗·洛克哈特。它讨论了数学教育，并认为学生被教授数学工作的程序方法，忽视概念和创造性解决问题的技能。作者相信，真正的数学理解可以为个人和整个社会释放有价值的见解和潜力。如果您喜欢数学、计算机科学和哲学的交叉点，强烈推荐。快乐探索！

原文

This is an upcoming high performance computing book titled “Algorithms for Modern Hardware” by Sergey Slotin.

Its intended audience is everyone from performance engineers and practical algorithm researchers to undergraduate computer science students who have just finished an advanced algorithms course and want to learn more practical ways to speed up a program than by going from $O(n \log n)$ to $O(n \log \log n)$.

All book materials are hosted on GitHub, with code in a separate repository. This isn’t a collaborative project, but any contributions and feedback are very much welcome.

FAQ

Bug/typo fixes. If you spot an error on any page, please do one of these — in the order of preference:

fix it right away by either clicking on the pencil icon on the top right on any page (opens the Prose editor) or, more traditionally, by modifying the page directly on GitHub (the link to the source is also on the top right);
create an issue on GitHub;
tell me about it directly;

or leave a comment on some other website where it is being discussed — I read most of HackerNews, CodeForces, and Twitter threads where I’m tagged.

Release date. The book is split into several parts that I plan to finish sequentially with long breaks in-between. Part I, Performance Engineering, is ~75% complete as of March 2022 and will hopefully be >95% complete by this summer.

A “release” for an open-source book like this essentially means:

finishing all essential sections and filling all the TODOs,
mostly freezing the table of contents (except for the case studies),
doing one final round of heavy copyediting (hopefully, with the help of a professional editor — I still haven’t figured out how commas work in English),
drawing illustrations (I stole a lot of those that are currently displayed),
making a print-optimized PDF and figuring out the best way to distribute it.

After that, I will mostly be fixing errors and only doing some minor edits reflecting the changes in technology or new algorithm advancements. The e-book/printed editions will most likely be sold on a “pay what you want” basis, and in any case, the web version will always be fully available online.

Pre-ordering / financially supporting the book. Due to my unfortunate citizenship and place of birth, you can’t — that is, until I find a way that at the same time complies with international sanctions, doesn’t sponsor the war, and won’t put me in prison for tax evasion.

So, don’t bother. If you want to support this book, just share it and help fix typos — that would be enough.

Translations. The website has a separate functionality for creating and managing translations — and I’ve already been contacted by some nice people willing to translate the book into Italian and Chinese (and I will personally translate at least some of it into my native Russian).

However, as the book is still evolving, it is probably not the best idea to start translating it at least until Part I is finished. That said, you are very much encouraged to make translations of any articles and publish them in your blogs — just send me the link so that we can merge it back when centralized translation starts.

“Translating” the Russian version. The articles hosted at ru.algorithmica.org/cs/ are not about advanced performance engineering but mostly about classical computer science algorithms — without discussing how to speed them up beyond asymptotic complexity. Most of the information there is not unique and already exists in English on some other places on the internet: for example, the similar-spirited cp-algorithms.com.

Teaching performance engineering in colleges. One of my goals for writing this book is to change the way computer science — algorithm design, to be more precise — is taught in colleges. Let me elaborate on that.

There are two highly impactful textbooks on which most computer science courses are built. Both are undoubtedly outstanding, but one of them is 50 years old, and the other is 30 years old, and computers have changed a lot since then. Asymptotic complexity is not the sole deciding factor anymore. In modern practical algorithm design, you choose the approach that makes better use of different types of parallelism available in the hardware over the one that theoretically does fewer raw operations on galaxy-scale inputs.

And yet, the computer science curricula in most colleges completely ignore this shift. Although there are some great courses that aim to correct that — such as “Performance Engineering of Software Systems” from MIT, “Programming Parallel Computers” from Aalto University, and some non-academic ones like Denis Bakhvalov’s “Performance Ninja” — most computer science graduates still treat modern hardware like something from the 1990s.

What I really want to achieve is that performance engineering becomes taught right after introduction to algorithms. Writing the first comprehensive textbook on the subject is a large part of it, and this is why I rush to finish it by the summer so that the colleges can pick it up in the next academic year. But creating a new course requires more than that: you need a balanced curriculum, course infrastructure, lecture slides, lab assignments… so for some time after finishing the main book, I will be working on course materials and tools for teaching performance engineering — and I’m looking forward to collaborating with other people who want to make it a reality as well.

Part I: Performance Engineering

The first part covers the basics of computer architecture and optimization of single-threaded algorithms.

It walks through the main CPU optimization topics such as caching, SIMD, and pipelining, and provides brief examples in C++, followed by large case studies where we usually achieve a significant speedup over some STL algorithm or data structure.

Planned table of contents:

0. Preface
1. Complexity Models
 1.1. Modern Hardware
 1.2. Programming Languages
 1.3. Models of Computation
 1.4. When to Optimize
2. Computer Architecture
 1.1. Instruction Set Architectures
 1.2. Assembly Language
 1.3. Loops and Conditionals
 1.4. Functions and Recursion
 1.5. Indirect Branching
 1.6. Machine Code Layout
 1.7. System Calls
 1.8. Virtualization
3. Instruction-Level Parallelism
 3.1. Pipeline Hazards
 3.2. The Cost of Branching
 3.3. Branchless Programming
 3.4. Instruction Tables
 3.5. Instruction Scheduling
 3.6. Throughput Computing
 3.7. Theoretical Performance Limits
4. Compilation
 4.1. Stages of Compilation
 4.2. Flags and Targets
 4.3. Situational Optimizations
 4.4. Contract Programming
 4.5. Non-Zero-Cost Abstractions
 4.6. Compile-Time Computation
 4.7. Arithmetic Optimizations
 4.8. What Compilers Can and Can't Do
5. Profiling
 5.1. Instrumentation
 5.2. Statistical Profiling
 5.3. Program Simulation
 5.4. Machine Code Analyzers
 5.5. Benchmarking
 5.6. Getting Accurate Results
6. Arithmetic
 6.1. Floating-Point Numbers
 6.2. Interval Arithmetic
 6.3. Newton's Method
 6.4. Fast Inverse Square Root
 6.5. Integers
 6.6. Integer Division
 6.7. Bit Manipulation
(6.8. Data Compression)
7. Number Theory
 7.1. Modular Inverse
 7.2. Montgomery Multiplication
(7.3. Finite Fields)
(7.4. Error Correction)
 7.5. Cryptography
 7.6. Hashing
 7.7. Random Number Generation
8. External Memory
 8.1. Memory Hierarchy
 8.2. Virtual Memory
 8.3. External Memory Model
 8.4. External Sorting
 8.5. List Ranking
 8.6. Eviction Policies
 8.7. Cache-Oblivious Algorithms
 8.8. Spacial and Temporal Locality
(8.9. B-Trees)
(8.10. Sublinear Algorithms)
(9.13. Memory Management)
9. RAM & CPU Caches
 9.1. Memory Bandwidth
 9.2. Memory Latency
 9.3. Cache Lines
 9.4. Memory Sharing
 9.5. Memory-Level Parallelism
 9.6. Prefetching
 9.7. Alignment and Packing
 9.8. Pointer Alternatives
 9.9. Cache Associativity
 9.10. Memory Paging
 9.11. AoS and SoA
10. SIMD Parallelism
 10.1. Intrinsics and Vector Types
 10.2. Moving Data
 10.3. Reductions
 10.4. Masking and Blending
 10.5. In-Register Shuffles
 10.6. Auto-Vectorization and SPMD
11. Algorithm Case Studies
 11.1. Binary GCD
(11.2. Prime Number Sieves)
 11.3. Integer Factorization
 11.4. Logistic Regression
 11.5. Big Integers & Karatsuba Algorithm
 11.6. Fast Fourier Transform
 11.7. Number-Theoretic Transform
 11.8. Argmin with SIMD
 11.9. Prefix Sum with SIMD
 11.10. Reading Decimal Integers
 11.11. Writing Decimal Integers
(11.12. Reading and Writing Floats)
(11.13. String Searching)
 11.14. Sorting
 11.15. Matrix Multiplication
12. Data Structure Case Studies
 12.1. Binary Search
 12.2. Static B-Trees
(12.3. Search Trees)
 12.4. Segment Trees
(12.5. Tries)
(12.6. Range Minimum Query)
 12.7. Hash Tables
(12.8. Bitmaps)
(12.9. Probabilistic Filters)

Among the cool things that we will speed up:

2x faster GCD (compared to std::gcd)
8-15x faster binary search (compared to std::lower_bound)
5-10x faster segment trees (compared to Fenwick trees)
5x faster hash tables (compared to std::unordered_map)
2x faster popcount (compared to repeatedly calling popcnt)
35x faster parsing series of integers (compared to scanf)
?x faster sorting (compared to std::sort)
2x faster sum (compared to std::accumulate)
2-3x faster prefix sum (compared to naive implementation)
10x faster argmin (compared to naive implementation)
10x faster array searching (compared to std::find)
15x faster search tree (compared to std::set)
100x faster matrix multiplication (compared to “for-for-for”)
optimal word-size integer factorization (~0.4ms per 60-bit integer)
optimal Karatsuba Algorithm
optimal FFT

Volume: 450-600 pages
Release date: Q3 2022

Part II: Parallel Algorithms

Concurrency, models of parallelism, context switching, green threads, concurrent runtimes, cache coherence, synchronization primitives, OpenMP, reductions, scans, list ranking, graph algorithms, lock-free data structures, heterogeneous computing, CUDA, kernels, warps, blocks, matrix multiplication, sorting.

Volume: 150-200 pages
Release date: 2023-2024?

Part III: Distributed Computing

Metworking, message passing, actor model, communication-constrained algorithms, distributed primitives, all-reduce, MapReduce, stream processing, query planning, storage, sharding, compression, distributed databases, consistency, reliability, scheduling, workflow engines, cloud computing.

Release date: ??? (more likely to be completed than not)

Part IV: Software & Hardware

LLVM IR, compiler optimizations & back-end, interpreters, JIT-compilation, Cython, JAX, Numba, Julia, OpenCL, DPC++, oneAPI, XLA, (basic) Verilog, FPGAs, ASICs, TPUs and other AI accelerators.

Release date: ??? (less likely to be completed than not)

Acknowledgements

The book is largely based on blog posts, research papers, conference talks, and other work authored by a lot of people:

Disclaimer: Technology Choices

The examples in this book use C++, GCC, x86-64, CUDA, and Spark, although the underlying principles conveyed are not specific to them.

To clear my conscience, I’m not happy with any of these choices: these technologies just happen to be the most widespread and stable at the moment and thus more helpful to the reader. I would have respectively picked C / Rust / Carbon?, LLVM, arm, OpenCL, and Dask; maybe there will be a 2nd edition in which some of the tech stack is changed.