我用Zig在3秒内计算了3300万个卫星位置,无需GPU。
I Made Zig Compute 33M Satellite Positions in 3 Seconds. No GPU Required

原始链接: https://atempleton.bearblog.dev/i-made-zig-compute-33-million-satellite-positions-in-3-seconds-no-gpu-required/

## Astroz:一种快速的SGP4实现 Astroz是一种新的、高度优化的SGP4(卫星标准引力参数)实现,在原生Zig语言中可达到**1100万-1300万次传播/秒**的速度,通过`pip install astroz`在Python中可达到**约700万次/秒**的速度。这显著优于许多现有实现,得益于巧妙的设计选择和利用Zig的特性。 优化重点在于满足密集时间分辨率需求的速度——生成星历数据、精确的过境预测和交会筛选。关键改进包括**无分支代码**以实现可预测的CPU执行,以及在Zig中进行广泛的**编译时预计算**,从而减少运行时开销。至关重要的是,Astroz利用**SIMD(单指令多数据)**处理,同时处理四个数据点。 虽然`heyoka.py`在处理大量并发卫星时可能更快,但Astroz在时间批处理传播(单个卫星的多个时间点)方面表现出色,速度是其两倍。为了保持SIMD效率,实现了一种新的**`atan2`函数的多项式近似**。 Astroz提供三种传播模式——时间批处理、卫星批处理和缓存感知的“星座”模式——以适应不同的工作负载。一个在线演示展示了将整个CelesTrak目录(13,000颗卫星)在短短3.3秒内传播一整天。该项目是开源的,可在PyPI和GitHub上获取,未来计划支持SDP4和多线程。

一位开发者使用 Zig 编程语言在 3 秒内计算了 3300 万颗卫星的位置,*无需* GPU。速度的关键在于利用标准处理器的 SIMD(单指令多数据)能力,特别是通过优化的内存布局来实现向量化。 Hacker News 的评论员称赞 Zig 使 SIMD 实现相对容易。然而,一些人指出原始帖子中的吞吐量比较图表可能存在误导性的可视化效果——y 轴没有从零开始,夸大了感知到的性能提升(实际上约为 2 倍)。 这场讨论强调,通过在标准 CPU 上进行巧妙的代码优化,可以实现显著的性能提升,通常可以消除对专用图形硬件的需求,用于某些计算。该帖子也为不熟悉 SIMD 概念的人提供了一个很好的介绍。
相关文章

原文

I've spent the past month optimizing SGP4 propagation and ended up with something interesting: astroz is now the fastest general purpose SGP4 implementation I'm aware of, hitting 11-13M propagations per second in native Zig and ~7M/s through Python with just pip install astroz. This post breaks down how I got there.

A note on "general purpose": heyoka.py can be faster for batch-processing many satellites simultaneously (16M/s vs 7.5M/s). But it's a general ODE integrator with SGP4 as a module, requiring LLVM for JIT compilation and a C++ dependency stack that conda-forge recommends over pip. For time batched propagation, many time points for one satellite, astroz is 2x faster (8.5M/s vs 3.8M/s). Full comparison below. I'm also skipping GPU accelerated SGP4 implementations. They can be faster for massive batch workloads, but require CUDA/OpenCL setup and aren't what I'd consider "general purpose."

Why Bother Optimizing SGP4?

SGP4 is the standard algorithm for predicting satellite positions from TLE data. It's been around since the 80s and most implementations are straightforward ports of the original reference code. They work fine. You can read the implementation that I followed from SpaceTrack Report No. 3.

But "fine" starts to feel slow when you need dense time resolution. Generating a month of ephemeris data at one-second intervals is 2.6 million propagations per satellite. Pass prediction over a ground station network might need sub-second precision across weeks. Trajectory analysis for conjunction screening wants fine-grained time steps to catch close approaches. At 2-3M propagations per second (typical for a good implementation), these workloads take seconds per satellite—that adds up fast when you're doing iterative analysis or building interactive tools.

I wanted to see how fast I could make it.

Starting Point: Already Faster Than Expected

Before I even started thinking about SIMD, the scalar implementation was already matching or beating the Rust sgp4 crate, the fastest open-source implementation I could find (general purpose). I hadn't done anything clever yet; the speed came from design choices that happened to play well with how Zig compiles.

Two things mattered most:

Branchless hot paths. The SGP4 algorithm has a lot of conditionals. Deep space vs near earth, different perturbation models, and convergence checks in the Kepler solver. I wrote these as branchless expressions where possible, not for performance reasons initially, but because it made the code easier to reason about. It happened to be a happy accident that modern CPUs love predictable instruction streams.

Comptime precomputation. Zig's comptime lets you run arbitrary code at compile time. A lot of SGP4's setup work, ie. gravity constants, polynomial coefficients, derived parameters can be computed once and baked into the binary. No runtime initialization, and no repeated calculations. I didn't have to do anything special. Zig is smart enough where if you mark a variable as const it treats it as comptime automatically.

const j2: comptime_float = 1.082616e-3; // explicit comptime (not needed)
const j2 = 1.082616e-3; // implied comptime because of `const` (what I used)

The result was a scalar implementation running at ~5.2M propagations per second, which was already slightly faster than Rust's ~5.0M, but within a margin of error. But I started to see some room to go faster. SGP4, by design, doesn't rely on state to calculate positions: each satellite and each time point is independent. This algorithm feels tailor made for something I have always been too afraid to try: SIMD.

Discovering Zig's SIMD Superpowers

I have heard the nightmares about implementing SIMD. Most of the time its never worth it, it adds too much complexity, you have to build for different platforms, and the syntax itself is weird to write and think about.

I was pleasantly surprised to learn that Zig, whether on purpose or not, makes SIMD a first class citizen. This is all enabled by a powerful and sane standard library that has builtins that handle the weird stuff for me. Now I just had to tackle the thought process for the basic flow of things in SIMD.

I started with this foundation; a simple type declaration:

const Vec4 = @Vector(4, f64);

That's all you need to start. I now have a 4-wide vector of 64-bit floats. No intrinsics, no platform detection, no conditional compilation. The LLVM backend handles targeting the right instruction set for wherever the code runs.

The builtin operations for vector operations are equally simple:

// Broadcast a scalar to all lanes
const twoPiVec: Vec4 = @splat(constants.twoPi);

// Auto-vectorized transcendentals through LLVM
pub fn sinSIMD(x: Vec4) Vec4 {
    return @sin(x);
}

pub fn cosSIMD(x: Vec4) Vec4 {
    return @cos(x);
}

The @sin and @cos builtins map directly to LLVM intrinsics, which use platform optimal implementations like libmvec on Linux x86_64. No manual work required.

Learning to Think in Lanes

The arithmetic was easy. What took me a while to internalize was branching.

In scalar code, you write if statements and the CPU takes one path or the other. In SIMD, all four lanes execute together. If lane 0 needs the "true" branch and lane 2 needs the "false" branch, you can't just branch, you have to compute both outcomes and then pick per lane.

Here's a concrete example. The scalar SGP4 code has a check like this:

// Scalar version
if (eccentricity < 1.0e-4) {
    result = simple_calculation(x);
} else {
    result = complex_calculation(x);
}

In SIMD, this becomes:

// SIMD version - compute both, select per-lane
const simple_result = simple_calculation(x);
const complex_result = complex_calculation(x);
const mask = eccentricity < @as(Vec4, @splat(1.0e-4));
const result = @select(f64, mask, simple_result, complex_result);

This felt wasteful at first. Why compute both paths? But modern CPUs are so fast at arithmetic that computing both and selecting is often faster than branch misprediction. Plus, for SGP4, most satellites take the same path anyway, so we're rarely doing truly "wasted" work.

The trickier case was convergence loops. SGP4's Kepler solver iterates until each result converges. In scalar code:

// Scalar Kepler solver
while (@abs(delta) > tolerance) {
    // iterate...
}

But in SIMD, different lanes converge at different rates. Lane 0 might converge in 3 iterations while lane 3 needs 5. You can't exit early for just one lane. The solution is to track convergence per lane with a mask and use @reduce to check if everyone's done:

// SIMD Kepler solver
var converged: @Vector(4, bool) = @splat(false);
while (!@reduce(.And, converged)) {
    // iterate...
    converged = @abs(delta) <= tolerance_vec;
}

Once I understood this pattern, compute everything, mask the results, reduce to check completion, the rest of the conversion was methodical. I went through the scalar implementation line by line, keeping the original untouched so my test suite could compare outputs.

The Three Propagation Modes

With the core SIMD patterns figured out, I built three different propagation strategies for different use cases.

1. Time Batched: propagateV4

The first question I asked: what's the most common workload? For me, it was generating ephemeris data, and propagating a single satellite across many time points. Pass prediction, trajectory analysis, conjunction screening: they all want dense time series for one object.

Time batched propagation processes 4 time points for one satellite simultaneously:

pub fn propagateV4(self: *const Sgp4, times: [4]f64) Error![4][2][3]f64 {
    const el = &self.elements;
    const timeV4 = Vec4{ times[0], times[1], times[2], times[3] };

    const secular = updateSecularV4(el, timeV4);
    const nm: Vec4 = @as(Vec4, @splat(el.grav.xke)) / simdMath.pow15V4(secular.a);
    const kepler = solveKeplerV4(el, secular);
    const corrected = applyShortPeriodCorrectionsV4(el, kepler, nm);
    return computePositionVelocityV4(el, corrected);
}

Usage is straightforward:

const sat = try Sgp4.init(tle);
const times = [4]f64{ 0.0, 1.0, 2.0, 3.0 }; // minutes since epoch
const results = try sat.propagateV4(times);
// results[0] = position/velocity at t=0, results[1] at t=1, etc.

This mode gave me the biggest initial speedup because it's the most cache friendly: the satellite's orbital elements stay in registers while we compute four outputs.

2. Satellite Batched: propagateSatellitesV4

The opposite workload: what if you have many satellites and need their positions at one specific time? Collision screening snapshots, catalog-wide visibility checks, that sort of thing.

Satellite batched propagation processes 4 different satellites at the same time point:

pub inline fn propagateSatellitesV4(el: *const ElementsV4, tsince: f64) Error![4][2][3]f64 {
    const tsinceVec: Vec4 = @splat(tsince);
    const secular = updateSecularSatV4(el, tsinceVec);
    const nm: Vec4 = el.xke / simdMath.pow15V4(secular.a);
    const kepler = solveKeplerSatV4(el, secular);
    const corrected = applyShortPeriodCorrectionsSatV4(el, kepler, nm);
    return computePositionVelocitySatV4(el, corrected);
}

This required a different data layout. Instead of one satellite with 4 time points, I needed 4 satellites packed together. That's where ElementsV4 comes in. Its a struct where each field is a Vec4 holding values for 4 different satellites. More on that layout later.

3. Constellation Mode: propagateConstellationV4

The third workload combines both: propagate many satellites across many time points. This is what the live demo does: 13,000 satellites across 1,440 time points.

The naive approach would be: for each satellite, compute all time points. But that thrashes the cache. By the time you finish satellite 1's 1,440 points and move to satellite 2, all the time related data has been evicted.

Constellation mode uses cache conscious tiling:

// Time tile size tuned for L1 cache (~32KB)
const TILE_SIZE: usize = 64;

// Process in tiles over time to keep data in L1/L2 cache
var timeStart: usize = 0;
while (timeStart < numTimes) {
    const timeEnd = @min(timeStart + TILE_SIZE, numTimes);

    for (batches, 0..) |*batch, batchIdx| {
        for (timeStart..timeEnd) |timeIdx| {
            const satResults = try propagateSatellitesV4(batch, times[timeIdx]);
            // Store results...
        }
    }
    timeStart = timeEnd;
}

The idea here is to process 64 time points for all satellites, then the next 64, and so on. The time values stay hot in L1 cache while we sweep through the satellite batches. The tile size (64) isn't magic, it's roughly L1_size / sizeof(working_data) rounded to a SIMD-friendly number.

In practice, constellation mode is about 15-20% faster than calling satellite batched propagation in a naive loop for large catalogs.

The atan2 Problem (and Solution)

Here's where things got interesting. SGP4's Kepler solver needs atan2, but LLVM doesn't provide a vectorized builtin for it. Calling the scalar function would break the SIMD implementation.

The solution I picked: a polynomial approximation. The key insight is that for SGP4's accuracy requirements (which are inherently limited by the model), we don't need perfect precision.

pub fn atan2SIMD(y: Vec4, x: Vec4) Vec4 {
    const abs_x = @abs(x);
    const abs_y = @abs(y);

    // Keep argument in [0, 1] for better polynomial accuracy
    const max_xy = @max(abs_x, abs_y);
    const min_xy = @min(abs_x, abs_y);
    const t = min_xy / @max(max_xy, @as(Vec4, @splat(1.0e-30)));
    const t2 = t * t;

    const c1: Vec4 = @splat(1.0);
    const c3: Vec4 = @splat(-0.3333314528);
    const c5: Vec4 = @splat(0.1999355085);
    // ... more coefficients

    var atan_t = c17;
    atan_t = atan_t * t2 + c15;
    atan_t = atan_t * t2 + c13;
    // ... Horner's method continues
    atan_t = atan_t * t;

    // Quadrant correction using branchless selects
    const swap_mask = abs_y > abs_x;
    atan_t = @select(f64, swap_mask, halfPiVec - atan_t, atan_t);

    // ... more quadrant handling with @select
    return result;
}

This polynomial approximation is accurate to ~1e-7 radians, which translates to about 10mm position error at LEO distances. That's well within SGP4's inherent accuracy limits. The algorithm itself has kilometers of uncertainty over multi-day propagations built into it.

To be honest, this math was tricky for me to wrap my head around. I had to ask AI to help me here because I was really struggling with it.

"Struct of Arrays" for Multi Satellite Processing

For processing multiple satellites, I use a "struct of arrays" layout:

pub const ElementsV4 = struct {
    grav: constants.Sgp4GravityModel,

    // Each field is a Vec4 holding values for 4 satellites
    ecco: Vec4,
    inclo: Vec4,
    nodeo: Vec4,
    argpo: Vec4,
    mo: Vec4,
    // ...

    // Pre-splatted constants (computed once at init)
    xke: Vec4,
    j2: Vec4,
    one: Vec4,
    half: Vec4,
    // ...
};

"Pre-splatting" constants eliminates repeated @splat calls in the hot path. It's a small optimization, but in code running millions of times per second, everything counts, and its an easy win.

The Benchmark Results

Native Implementation Comparison (Zig vs Rust)

First, let's compare apples to apples. Native compiled implementations:

Scenario astroz (Zig) Rust sgp4 Speedup
1 day (minute) 0.27 ms 0.31 ms 1.16x
1 week (minute) 1.99 ms 2.04 ms 1.03x
2 weeks (minute) 3.87 ms 4.03 ms 1.04x
2 weeks (second) 222 ms 234 ms 1.05x
1 month (minute) 8.37 ms 8.94 ms 1.07x

Both implementations achieve around 5M propagations/sec for scalar (single-satellite) processing. The Zig implementation edges out Rust slightly. This is most likely hot path optimizations and comptime being quite aggressive with its pre compute.

Native SIMD Throughput Comparison

Native Throughput

The real gains come from SIMD. When processing multiple satellites or time points in parallel using @Vector(4, f64), throughput jumps to 11-13M propagations/sec, more than 2x faster than scalar implementations.

Python Bindings Performance

For Python users, here's how astroz compares to python-sgp4:

Scenario astroz python-sgp4 Speedup
2 weeks (second) 160 ms 464 ms 2.9x
1 month (minute) 5.9 ms 16.1 ms 2.7x

Python Bindings Throughput

Python Bindings Throughput

A Note on heyoka.py

As mentioned in the intro, heyoka.py deserves attention. Here are the single-threaded benchmarks:

Test heyoka.py astroz
8 sats × 1440 times 16.2M/s 7.5M/s
1 sat × 1440 times 3.8M/s 8.5M/s
100 sats × 100 times 15.5M/s 8.4M/s

heyoka.py wins on multi satellite batches; astroz wins on time batched workloads. Which one you pick depends on your use case:

  • How many satellites at once? If you're propagating hundreds of satellites at the same time point (collision screening snapshots), heyoka.py wins this easily.
  • How many time points per satellite? If you're generating ephemerides, predicting passes, or doing trajectory analysis for individual satellites across many time steps, astroz is 2x faster.
  • Do you need easy deployment? astroz is pip install astroz with just NumPy. heyoka.py requires LLVM and a C++ stack that conda-forge recommends over pip.

Seeing It at Scale

Speed numbers are abstract until you see what they enable. When propagation is this fast, you stop thinking about batching and scheduling, you just compute what you need, when you need it.

The Live Demo

To show what this looks like, I built an interactive Cesium visualization that propagates the entire active catalog from CelesTrak (~13,000 satellites) across 1440 time points (a full day at minute resolution). That's ~19 million propagations completing in about 2.7 seconds. Add ~0.6 seconds for TEME→ECEF coordinate conversion, and you get a full day of orbital data for every tracked satellite in about 3.3 seconds. (This demo runs through Python bindings at ~7M/sec; native Zig hits 11-13M/sec.)

Looking Forward

Next up: SDP4 for deep space objects (the current implementation only handles near-earth satellites with periods under 225 minutes), and multithreading to scale across cores. The SIMD work here was single threaded, there's another multiplier waiting.

Try It Yourself

astroz is available on PyPI:

Or add it to your Zig project:

zig fetch --save git+https://github.com/ATTron/astroz/#HEAD

The code is open source on GitHub. Stars, issues, and contributions welcome.


Browse the examples to integrate astroz into your own projects, or try the live demo to see it in action.

联系我们 contact @ memedata.com