I've spent the past month optimizing SGP4 propagation and ended up with something interesting: astroz is now the fastest general purpose SGP4 implementation I'm aware of, hitting 11-13M propagations per second in native Zig and ~7M/s through Python with just pip install astroz. This post breaks down how I got there.
A note on "general purpose": heyoka.py can be faster for batch-processing many satellites simultaneously (16M/s vs 7.5M/s). But it's a general ODE integrator with SGP4 as a module, requiring LLVM for JIT compilation and a C++ dependency stack that conda-forge recommends over pip. For time batched propagation, many time points for one satellite, astroz is 2x faster (8.5M/s vs 3.8M/s). Full comparison below. I'm also skipping GPU accelerated SGP4 implementations. They can be faster for massive batch workloads, but require CUDA/OpenCL setup and aren't what I'd consider "general purpose."
Why Bother Optimizing SGP4?
SGP4 is the standard algorithm for predicting satellite positions from TLE data. It's been around since the 80s and most implementations are straightforward ports of the original reference code. They work fine. You can read the implementation that I followed from SpaceTrack Report No. 3.
But "fine" starts to feel slow when you need dense time resolution. Generating a month of ephemeris data at one-second intervals is 2.6 million propagations per satellite. Pass prediction over a ground station network might need sub-second precision across weeks. Trajectory analysis for conjunction screening wants fine-grained time steps to catch close approaches. At 2-3M propagations per second (typical for a good implementation), these workloads take seconds per satellite—that adds up fast when you're doing iterative analysis or building interactive tools.
I wanted to see how fast I could make it.
Starting Point: Already Faster Than Expected
Before I even started thinking about SIMD, the scalar implementation was already matching or beating the Rust sgp4 crate, the fastest open-source implementation I could find (general purpose). I hadn't done anything clever yet; the speed came from design choices that happened to play well with how Zig compiles.
Two things mattered most:
Branchless hot paths. The SGP4 algorithm has a lot of conditionals. Deep space vs near earth, different perturbation models, and convergence checks in the Kepler solver. I wrote these as branchless expressions where possible, not for performance reasons initially, but because it made the code easier to reason about. It happened to be a happy accident that modern CPUs love predictable instruction streams.
Comptime precomputation. Zig's comptime lets you run arbitrary code at compile time. A lot of SGP4's setup work, ie. gravity constants, polynomial coefficients, derived parameters can be computed once and baked into the binary. No runtime initialization, and no repeated calculations. I didn't have to do anything special. Zig is smart enough where if you mark a variable as const it treats it as comptime automatically.
const j2: comptime_float = 1.082616e-3; // explicit comptime (not needed)
const j2 = 1.082616e-3; // implied comptime because of `const` (what I used)
The result was a scalar implementation running at ~5.2M propagations per second, which was already slightly faster than Rust's ~5.0M, but within a margin of error. But I started to see some room to go faster. SGP4, by design, doesn't rely on state to calculate positions: each satellite and each time point is independent. This algorithm feels tailor made for something I have always been too afraid to try: SIMD.
Discovering Zig's SIMD Superpowers
I have heard the nightmares about implementing SIMD. Most of the time its never worth it, it adds too much complexity, you have to build for different platforms, and the syntax itself is weird to write and think about.
I was pleasantly surprised to learn that Zig, whether on purpose or not, makes SIMD a first class citizen. This is all enabled by a powerful and sane standard library that has builtins that handle the weird stuff for me. Now I just had to tackle the thought process for the basic flow of things in SIMD.
I started with this foundation; a simple type declaration:
const Vec4 = @Vector(4, f64);
That's all you need to start. I now have a 4-wide vector of 64-bit floats. No intrinsics, no platform detection, no conditional compilation. The LLVM backend handles targeting the right instruction set for wherever the code runs.
The builtin operations for vector operations are equally simple:
// Broadcast a scalar to all lanes
const twoPiVec: Vec4 = @splat(constants.twoPi);
// Auto-vectorized transcendentals through LLVM
pub fn sinSIMD(x: Vec4) Vec4 {
return @sin(x);
}
pub fn cosSIMD(x: Vec4) Vec4 {
return @cos(x);
}
The @sin and @cos builtins map directly to LLVM intrinsics, which use platform optimal implementations like libmvec on Linux x86_64. No manual work required.
Learning to Think in Lanes
The arithmetic was easy. What took me a while to internalize was branching.
In scalar code, you write if statements and the CPU takes one path or the other. In SIMD, all four lanes execute together. If lane 0 needs the "true" branch and lane 2 needs the "false" branch, you can't just branch, you have to compute both outcomes and then pick per lane.
Here's a concrete example. The scalar SGP4 code has a check like this:
// Scalar version
if (eccentricity < 1.0e-4) {
result = simple_calculation(x);
} else {
result = complex_calculation(x);
}
In SIMD, this becomes:
// SIMD version - compute both, select per-lane
const simple_result = simple_calculation(x);
const complex_result = complex_calculation(x);
const mask = eccentricity < @as(Vec4, @splat(1.0e-4));
const result = @select(f64, mask, simple_result, complex_result);
This felt wasteful at first. Why compute both paths? But modern CPUs are so fast at arithmetic that computing both and selecting is often faster than branch misprediction. Plus, for SGP4, most satellites take the same path anyway, so we're rarely doing truly "wasted" work.
The trickier case was convergence loops. SGP4's Kepler solver iterates until each result converges. In scalar code:
// Scalar Kepler solver
while (@abs(delta) > tolerance) {
// iterate...
}
But in SIMD, different lanes converge at different rates. Lane 0 might converge in 3 iterations while lane 3 needs 5. You can't exit early for just one lane. The solution is to track convergence per lane with a mask and use @reduce to check if everyone's done:
// SIMD Kepler solver
var converged: @Vector(4, bool) = @splat(false);
while (!@reduce(.And, converged)) {
// iterate...
converged = @abs(delta) <= tolerance_vec;
}
Once I understood this pattern, compute everything, mask the results, reduce to check completion, the rest of the conversion was methodical. I went through the scalar implementation line by line, keeping the original untouched so my test suite could compare outputs.
The Three Propagation Modes
With the core SIMD patterns figured out, I built three different propagation strategies for different use cases.
1. Time Batched: propagateV4
The first question I asked: what's the most common workload? For me, it was generating ephemeris data, and propagating a single satellite across many time points. Pass prediction, trajectory analysis, conjunction screening: they all want dense time series for one object.
Time batched propagation processes 4 time points for one satellite simultaneously:
pub fn propagateV4(self: *const Sgp4, times: [4]f64) Error![4][2][3]f64 {
const el = &self.elements;
const timeV4 = Vec4{ times[0], times[1], times[2], times[3] };
const secular = updateSecularV4(el, timeV4);
const nm: Vec4 = @as(Vec4, @splat(el.grav.xke)) / simdMath.pow15V4(secular.a);
const kepler = solveKeplerV4(el, secular);
const corrected = applyShortPeriodCorrectionsV4(el, kepler, nm);
return computePositionVelocityV4(el, corrected);
}
Usage is straightforward:
const sat = try Sgp4.init(tle);
const times = [4]f64{ 0.0, 1.0, 2.0, 3.0 }; // minutes since epoch
const results = try sat.propagateV4(times);
// results[0] = position/velocity at t=0, results[1] at t=1, etc.
This mode gave me the biggest initial speedup because it's the most cache friendly: the satellite's orbital elements stay in registers while we compute four outputs.
2. Satellite Batched: propagateSatellitesV4
The opposite workload: what if you have many satellites and need their positions at one specific time? Collision screening snapshots, catalog-wide visibility checks, that sort of thing.
Satellite batched propagation processes 4 different satellites at the same time point:
pub inline fn propagateSatellitesV4(el: *const ElementsV4, tsince: f64) Error![4][2][3]f64 {
const tsinceVec: Vec4 = @splat(tsince);
const secular = updateSecularSatV4(el, tsinceVec);
const nm: Vec4 = el.xke / simdMath.pow15V4(secular.a);
const kepler = solveKeplerSatV4(el, secular);
const corrected = applyShortPeriodCorrectionsSatV4(el, kepler, nm);
return computePositionVelocitySatV4(el, corrected);
}
This required a different data layout. Instead of one satellite with 4 time points, I needed 4 satellites packed together. That's where ElementsV4 comes in. Its a struct where each field is a Vec4 holding values for 4 different satellites. More on that layout later.
3. Constellation Mode: propagateConstellationV4
The third workload combines both: propagate many satellites across many time points. This is what the live demo does: 13,000 satellites across 1,440 time points.
The naive approach would be: for each satellite, compute all time points. But that thrashes the cache. By the time you finish satellite 1's 1,440 points and move to satellite 2, all the time related data has been evicted.
Constellation mode uses cache conscious tiling:
// Time tile size tuned for L1 cache (~32KB)
const TILE_SIZE: usize = 64;
// Process in tiles over time to keep data in L1/L2 cache
var timeStart: usize = 0;
while (timeStart < numTimes) {
const timeEnd = @min(timeStart + TILE_SIZE, numTimes);
for (batches, 0..) |*batch, batchIdx| {
for (timeStart..timeEnd) |timeIdx| {
const satResults = try propagateSatellitesV4(batch, times[timeIdx]);
// Store results...
}
}
timeStart = timeEnd;
}
The idea here is to process 64 time points for all satellites, then the next 64, and so on. The time values stay hot in L1 cache while we sweep through the satellite batches. The tile size (64) isn't magic, it's roughly L1_size / sizeof(working_data) rounded to a SIMD-friendly number.
In practice, constellation mode is about 15-20% faster than calling satellite batched propagation in a naive loop for large catalogs.
The atan2 Problem (and Solution)
Here's where things got interesting. SGP4's Kepler solver needs atan2, but LLVM doesn't provide a vectorized builtin for it. Calling the scalar function would break the SIMD implementation.
The solution I picked: a polynomial approximation. The key insight is that for SGP4's accuracy requirements (which are inherently limited by the model), we don't need perfect precision.
pub fn atan2SIMD(y: Vec4, x: Vec4) Vec4 {
const abs_x = @abs(x);
const abs_y = @abs(y);
// Keep argument in [0, 1] for better polynomial accuracy
const max_xy = @max(abs_x, abs_y);
const min_xy = @min(abs_x, abs_y);
const t = min_xy / @max(max_xy, @as(Vec4, @splat(1.0e-30)));
const t2 = t * t;
const c1: Vec4 = @splat(1.0);
const c3: Vec4 = @splat(-0.3333314528);
const c5: Vec4 = @splat(0.1999355085);
// ... more coefficients
var atan_t = c17;
atan_t = atan_t * t2 + c15;
atan_t = atan_t * t2 + c13;
// ... Horner's method continues
atan_t = atan_t * t;
// Quadrant correction using branchless selects
const swap_mask = abs_y > abs_x;
atan_t = @select(f64, swap_mask, halfPiVec - atan_t, atan_t);
// ... more quadrant handling with @select
return result;
}
This polynomial approximation is accurate to ~1e-7 radians, which translates to about 10mm position error at LEO distances. That's well within SGP4's inherent accuracy limits. The algorithm itself has kilometers of uncertainty over multi-day propagations built into it.
To be honest, this math was tricky for me to wrap my head around. I had to ask AI to help me here because I was really struggling with it.
"Struct of Arrays" for Multi Satellite Processing
For processing multiple satellites, I use a "struct of arrays" layout:
pub const ElementsV4 = struct {
grav: constants.Sgp4GravityModel,
// Each field is a Vec4 holding values for 4 satellites
ecco: Vec4,
inclo: Vec4,
nodeo: Vec4,
argpo: Vec4,
mo: Vec4,
// ...
// Pre-splatted constants (computed once at init)
xke: Vec4,
j2: Vec4,
one: Vec4,
half: Vec4,
// ...
};
"Pre-splatting" constants eliminates repeated @splat calls in the hot path. It's a small optimization, but in code running millions of times per second, everything counts, and its an easy win.
The Benchmark Results
Native Implementation Comparison (Zig vs Rust)
First, let's compare apples to apples. Native compiled implementations:
| Scenario | astroz (Zig) | Rust sgp4 | Speedup |
|---|---|---|---|
| 1 day (minute) | 0.27 ms | 0.31 ms | 1.16x |
| 1 week (minute) | 1.99 ms | 2.04 ms | 1.03x |
| 2 weeks (minute) | 3.87 ms | 4.03 ms | 1.04x |
| 2 weeks (second) | 222 ms | 234 ms | 1.05x |
| 1 month (minute) | 8.37 ms | 8.94 ms | 1.07x |
Both implementations achieve around 5M propagations/sec for scalar (single-satellite) processing. The Zig implementation edges out Rust slightly. This is most likely hot path optimizations and comptime being quite aggressive with its pre compute.
Native SIMD Throughput Comparison
The real gains come from SIMD. When processing multiple satellites or time points in parallel using @Vector(4, f64), throughput jumps to 11-13M propagations/sec, more than 2x faster than scalar implementations.
Python Bindings Performance
For Python users, here's how astroz compares to python-sgp4:
| Scenario | astroz | python-sgp4 | Speedup |
|---|---|---|---|
| 2 weeks (second) | 160 ms | 464 ms | 2.9x |
| 1 month (minute) | 5.9 ms | 16.1 ms | 2.7x |
Python Bindings Throughput
A Note on heyoka.py
As mentioned in the intro, heyoka.py deserves attention. Here are the single-threaded benchmarks:
| Test | heyoka.py | astroz |
|---|---|---|
| 8 sats × 1440 times | 16.2M/s | 7.5M/s |
| 1 sat × 1440 times | 3.8M/s | 8.5M/s |
| 100 sats × 100 times | 15.5M/s | 8.4M/s |
heyoka.py wins on multi satellite batches; astroz wins on time batched workloads. Which one you pick depends on your use case:
- How many satellites at once? If you're propagating hundreds of satellites at the same time point (collision screening snapshots), heyoka.py wins this easily.
- How many time points per satellite? If you're generating ephemerides, predicting passes, or doing trajectory analysis for individual satellites across many time steps, astroz is 2x faster.
- Do you need easy deployment? astroz is
pip install astrozwith just NumPy. heyoka.py requires LLVM and a C++ stack that conda-forge recommends over pip.
Seeing It at Scale
Speed numbers are abstract until you see what they enable. When propagation is this fast, you stop thinking about batching and scheduling, you just compute what you need, when you need it.
The Live Demo
To show what this looks like, I built an interactive Cesium visualization that propagates the entire active catalog from CelesTrak (~13,000 satellites) across 1440 time points (a full day at minute resolution). That's ~19 million propagations completing in about 2.7 seconds. Add ~0.6 seconds for TEME→ECEF coordinate conversion, and you get a full day of orbital data for every tracked satellite in about 3.3 seconds. (This demo runs through Python bindings at ~7M/sec; native Zig hits 11-13M/sec.)
Looking Forward
Next up: SDP4 for deep space objects (the current implementation only handles near-earth satellites with periods under 225 minutes), and multithreading to scale across cores. The SIMD work here was single threaded, there's another multiplier waiting.
Try It Yourself
astroz is available on PyPI:
Or add it to your Zig project:
zig fetch --save git+https://github.com/ATTron/astroz/#HEAD
The code is open source on GitHub. Stars, issues, and contributions welcome.
Browse the examples to integrate astroz into your own projects, or try the live demo to see it in action.