更快的目标质量图像压缩

更快的目标质量图像压缩
oavif: Faster target quality image compression

原始链接: https://giannirosato.com/blog/post/oavif/

## oavif：快速目标质量图像编码 oavif 是一款新工具，专为极速目标质量编码而设计——这对持续以期望的视觉质量交付图像至关重要，对于网站所有者和 CDN 而言都是必不可少的。与依赖编码器内部质量设置不同，oavif 专注于实现由代表人类视觉的指标判断的特定感知质量分数。 oavif 通过改进三个关键领域实现速度：**指标**、**编码器**和**收敛算法**。它利用 `fssimu2`，这是 SSIMULACRA2 指标的更快实现，显著减少了评分时间。它利用 AVIF 格式和高效的 `libaom` 编码器，受益于最近的一致性改进。其核心创新在于其收敛算法，超越了简单的二分搜索。oavif 采用**预测建模**——基于图像数据进行训练——以智能地猜测最佳编码器设置，从而大大减少所需的编码次数。使用**误差边界**进行进一步优化，可以细化搜索空间，从而实现更快的收敛。测试表明，与传统方法相比，oavif 实现了显著的加速，编码次数减少高达 63.1%。oavif 使用 Zig 构建，以实现高性能，旨在使高质量、一致的图像编码易于访问且高效。未来的开发将侧重于通过结合图像特定数据来提高预测建模的准确性。

Hacker News新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 avif: 更快的目标质量图像压缩 (giannirosato.com) 3点由 computerbuster 1小时前 | 隐藏 | 过去 | 收藏 | 讨论考虑申请YC冬季2026批次！申请截止至11月10日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

oavif is a new approach to target quality encoding in image compression, designed around smarter convergence strategies and quicker scoring to be as fast as possible.

Why?

Target quality encoding is one of the highest impact use cases for image compression. A target quality encoder framework aims to produce an image encoded at a particular quality set by the user according to some metric or visual quality index. This kind of encoder framework is useful for a variety of users, ranging from small website owners to content delivery networks pushing vast quantities of image data through the Web.

The value of target quality encoding is perceptual consistency. If I rely entirely on my encoder's internal quality index (often set with a "q" parameter), I may not get outputs of perfectly consistent quality when using the same "q" across different images. Relying on a metric that represents the viewer's experience is the solution to this; targeting a representative score within this metric will ensure you always receive an optimally encoded file that never undershoots and ruins image quality, and never overshoots and wastes data.

oavif is a tool to do target quality encoding extremely quickly. There are three core components to a target quality encoding framework: the metric, the encoder, and the convergence algorithm. oavif aims to leverage or improve the state of the art in all three categories.

I think this use case has been neglected because it sits in an awkward spot, stuck between encoder development and content deployment. Considering image encoders and powerful metrics are fast, it is easy to take them for granted and build inefficient frameworks around them. Slow frameworks waste valuable resources; processing images is expensive. I built oavif with the same approach I've adopted when building encoders, where every CPU cycle counts.

Metric

There are good metrics and bad metrics in the context of what humans care about in images. PSNR is a bad metric; targeting a PSNR score doesn't mean anything to users, because images at the same PSNR could look completely different. However, PSNR is very fast, and faster metrics lend themselves more favorably to target quality encoding.

SSIMULACRA2 correlates highly with subjective human quality ratings, but it is comparatively slow compared to simpler metrics. I set out to remedy this with fssimu2, a faster implementation that uses almost 40% less memory. This is what oavif uses, and it makes computing the in-loop metric much faster compared to the reference library.

Testing on a 4k test image against a distorted sample for an average time to score across 8 runs, Butteraugli (a perceptual metric from the libjxl project) took 2455ms, while the reference SSIMULACRA2 implementation took 1162ms. fssimu2 takes 631.9ms. Testing was done on my M2 MacBook Air using hyperfine.

fssimu2 performance graph

Encoder

AVIF is a capable Web image format. oavif uses libaom (via libavif) because it is the best open-source image encoder available relative to its speed. I worked on improving AVIF encoding in 2024 during my work on SVT-AV1-PSY. Google (with help from Julio Barba) later adopted this work and advanced it further in libaom. It is now used by some websites you may know, such as The Guardian.

Aside from speed, encoder consistency is valuable in the context of target quality encoding (I'll explain more about why later). In fact, a perfectly consistent encoder would be able to eliminate the need for targeting entirely since the encoder's user-configurable Q would map perfectly to some perceptual index. libaom has had engineering effort go into encoder consistency, which is a valuable thing.

Convergence

A simple convergence loop looks like this:

Decode input, pass to encoder
Decode encoder output & compare to input with metric
If we hit the target metric score, finish; otherwise, repeat prev. step with modified settings

The most important part here is how we decide to modify our settings. This is the convergence algorithm that allows us to search for the best encoder Q. The easiest way to do this is with binary search, and some more recent implementations have utilized clever interpolation using past data to inform the next guess based on the fact that we know encoder Q and target score are likely correlated.

oavif takes inspiration from both of these, adding predictive modeling alongside error-informed search space correction to improve search times significantly. To understand why, we'll walk through each stage of the implementation.

Binary Search

This testing was done using the Daala subset2 image dataset. Importantly, I only used this dataset for validation; oavif was not designed around this specific dataset in any way. When testing, the oavif configuration was left at defaults; only the convergence implementation was modified. The threshold for meeting the target score is ±2.0 in oavif by default, and the default target score is 80.0 as measured by fssimu2 because it is a reasonable "high fidelity" target.

Everyone with some algorithms background will start with binary search. Set your bounds for encoder Q to 0..100, and divide the range in half each time you test. In oavif, a pure binary search implementation at default settings yields the following results:

Average encoding time: 467.95 ms ± 94.64
Average passes: 3.20 ± 0.45 (max: 4 min: 2)

Interpolation

Interpolation-based target quality searches by iteratively probing, measuring, and narrowing the search interval just like binary search. The difference is that it tries to model the score-vs-quantizer curve with interpolation (linear, quadratic, etc) as more data is accumulated. This should theoretically reduce the number of necessary encodes, and can start with standard binary search when there is not enough data to interpolate with.

Metric score vs encoder Q is generally (though not perfectly) a mostly monotonic curve. Interpolation-based inverse estimation uses the measured points to approximate that curve and solve for the quantizer that would produce the target score. Higher-order methods use more shape information and are thus theoretically more accurate. Adding linear and quadratic interpolation support to oavif, we see a small reduction in the average number of passes on subset2:

Average encoding time: 468.98 ms ± 97.28
Average passes: 3.12 ± 0.39 (max: 4 min: 2)

This is a 2.5% improvement. We still need a minimum of two passes to accurately target.

Predictive Modeling

This feature uses an exponential curve trained on the gb82 image dataset with libaom (at speed 9, 10-bit, 4:4:4 chroma). The curve looks like this:

libaom predicted Q from SSIMULACRA2 curve

Based on this, we can write some very simple code to predict a Q value from the target score:

fn predictQFromScore(tgt: f64) u32 {
    const q = 6.83 * @exp(0.0282 * tgt);
    return @intFromFloat(@min(100.0, @round(q)));
}

This yields the biggest average improvement in this testing so far, decreasing average pass count by 56.4% versus interpolation search and 57.5% versus binary search.

Average encoding time: 218.33 ms ± 114.16
Average passes: 1.36 ± 0.78 (max: 3 min: 1)

The gb82 image set is fairly low-resolution mixed photographic content, while Daala subset2 is medium-resolution photographic content with less variation. The fact that the model generalizes so well is exciting.

It is at this stage that encoder consistency becomes important. A more consistent encoder will diverge from our model's predictions less frequently, and theoretically result in a faster target quality loop.

Error Bounds

Because our initial predictions tend to be so accurate, we can use them to aggressively narrow our search space without incurring too much risk of a search space collapse.

The basis of this is that utilizing plain binary search with prediction is often unreliable. Let's say we would like to target score=80, and our model predicts we need Q=65. We score 82.38. Now we are forced to search (0..65), which is worse than if we had just avoided prediction in the first place (our search space would be 50..100 in that case). This is in spite of the fact that our prediction was very close to the target.

oavif uses the distance from the target to its advantage:

const abs_err = @abs(e.t.score - o.score_tgt);
if (pass == 0) {
    const err_bound: u32 = @intFromFloat(@ceil(abs_err) * 4.0);
    if (e.t.score - o.score_tgt > 0) {
        hi_bound = e.q;
        lo_bound = if (e.q > err_bound) e.q - err_bound else 0;
    } else {
        lo_bound = e.q;
        hi_bound = @min(100, e.q + err_bound);
    }
}

In this case, the error was 2.38; @ceil() brings this to 3, and we multiply by 4 because the midpoint of the new range tends to be very close to the target value based on my testing. The performance improves in kind:

Average encoding time: 194.50 ms ± 69.89
Average passes: 1.18 ± 0.39 (max: 2 min: 1)

This costs 13.2% fewer passes than interpolation + prediction, and 63.1% fewer passes than binary search.

oavif Convergence Passes

The minimum number of passes necessary in a naive binary search or interpolation-informed search is now the maximum number of passes we need to converge on the target on Daala subset2. You'll also notice the standard deviation went down due to the fact that the ceiling has been lowered.

Architecture

We've made it work and we've made it good, so now we can make it fast. oavif is written in Zig, and uses available high-performance C decoder libraries for handling inputs and decoding AVIF in the convergence loop. All image I/O during convergence is done in memory, and a buffer is kept of our latest encode to write to a file if we meet the target in the search space.

Efforts have gone into making oavif comparable to libavif's avifenc in terms of features as well. It supports high bit depth I/O, ICC profile handling for most formats, user-configurable encoder settings, and better defaults (until tune=iq becomes the libaom default in libavif).

Future Directions

Architecturally, it would be trivial to keep a history of buffers active and always pick from the history, even if our loop doesn't converge on the target. I opted to avoid this for now because it dramatically increases memory usage, but if I receive widespread feedback that memory is unimportant I'll consider an implementation. In its current state, we hit the in-loop buffer the vast majority of the time anyway.

I think the future of this kind of workflow is far more accurate predictive modeling. I believe it is possible to improve what I've done if we provide details about the source image as another term in the equation (like variance or entropy) and train our prediction mechanism on this additional data. I'm optimistically convinced this could result in a very high success rate for one-shot targeting.

I'm looking forward to seeing more target quality workflows taking advantage of smarter targeting. If you've made it this far, thanks for reading, and enjoy oavif!