使用FFmpeg中的Vulkan计算着色器进行视频编码和解码

使用FFmpeg中的Vulkan计算着色器进行视频编码和解码
Video Encoding and Decoding with Vulkan Compute Shaders in FFmpeg

原始链接: https://www.khronos.org/blog/video-encoding-and-decoding-with-vulkan-compute-shaders-in-ffmpeg

## 使用 FFmpeg 和 Vulkan 解放 GPU 性能，助力专业视频处理尽管得益于专用硬件和免版税编解码器，日常用户的视频编码/解码问题已基本解决，但 8K 编辑、VFX 和归档等专业工作流程仍然会遇到性能瓶颈。现有的解决方案通常价格昂贵或需要极端硬件。本文详细介绍了 FFmpeg 如何利用 **Vulkan Compute** 加速消费级 GPU 上的视频处理，扩展了超越 **Vulkan Video** 固定功能支持的加速能力。挑战在于编解码器固有的串行依赖性与 GPU 的并行处理优势相冲突。传统的“混合”方法（CPU 处理串行任务，GPU 处理并行任务）会产生延迟。FFmpeg 的解决方案是**完全 GPU 驻留**的计算着色器，利用现代 GPU 增强的并行性和跨调用通信能力。 FFmpeg 正在为 **FFv1、APV、ProRes、ProRes RAW、DPX、VC-2 和 JPEG** 等编解码器实现此功能，并展示了显著的加速效果。该方法涉及针对 GPU 架构进行优化，例如并行化 FFv1 中的范围编码器查找，以及利用 APV 中的基于瓦片的处理。 **Vulkan** 是关键，它提供了一个可移植、良好支持的 API，并具有不断发展的特性，例如 64 位寻址和共享内存。FFmpeg 优先考虑自给自足，避免厂商锁定，选择专门的着色器而不是复杂的专有 API。这使得强大的 GPU 加速能够通过广泛使用的 FFmpeg 框架惠及广大用户。

这个Hacker News讨论的核心是视频解码的复杂性，特别是FFmpeg中使用Vulkan计算着色器。一位名为“sylware”的评论者指出，硬件解码不可靠，即使是轻微的数据损坏也可能导致崩溃，有时甚至需要完全重启才能恢复。他们认为，对于复杂的格式，软件解码器通常更可靠，或者使用简单的计算着色器的保守方法更可取。一个关键问题是，硬件通常会静默失败，输出无意义的数据，软件无法检测到。此外，一些有效的流可以在*某些*解码器上正确解码，但在其他解码器上却不行，这增加了混乱。评论者强调媒体播放器需要提供一个明确的选项来切换到软件解码，并希望该项目避免从复杂C代码生成的SPIR-V着色器。本质上，可靠的视频解码是一个出乎意料的棘手问题。

Video encoding and decoding on the internet is largely a solved problem for everyday users. Most consumer devices now ship with dedicated hardware accelerator chips, to which APIs like the Vulkan® Video extensions provide direct access. Meanwhile, newer codecs are increasingly royalty-free with open specifications — or simply age out of licensing restrictions — making the standards accessible to everyone. It's easy to forget how demanding 720p H.264 decoding was on CPUs just 18 years ago. That challenge drove intense competition and optimization among software implementations, pushing performance to the limit until hardware decoding finally became commonplace. In professional workflows, however, performance walls still exist. Editors scrubbing through days of raw camera footage, colorists working with 8K 16-bit masters, VFX artists rendering 32-bit floating-point ACEScg video, and archivists handling extreme-resolution lossless film scans are still performance-bound. Where casual users once tolerated the occasional frame drop, today's professionals are often pushed toward expensive proprietary solutions or liquid-cooled, hundred-core workstations with hundreds of gigabytes of RAM. This post explores how FFmpeg uses Vulkan Compute to seamlessly accelerate encoding and decoding of even professional-grade video on consumer GPUs — unlocking GPU compute parallelism at scale, without specialized hardware. This approach complements Vulkan Video's fixed-function codec support, extending acceleration to formats and workflows it doesn't cover.

Codecs

Codecs are algorithms that exploit redundancy and patterns in a signal to compress it for storage or transmission. How easy is it to parallelize codec processing on a GPU? Take JPEG, the C. elegans of compression codecs, as an illustrative example. Encoding an image requires a 2D frequency transform (partially parallelizable, processing rows then columns), DC value prediction (fully serial), quantization to discard perceptually irrelevant information (fully parallel), and finally Huffman coding (extremely serial). The mix of parallel and serial steps turns out to be the central challenge for GPU codec acceleration.

Decoding reverses these steps — but the serial bottlenecks remain just as problematic. This is the fundamental tension: codec pipelines are riddled with serial dependencies, while GPUs are purpose-built to execute thousands of independent, uncorrelated operations simultaneously.

Compromises

The historically obvious approach was hybrid decoding: handle the serial steps (like coefficient decoding) on the CPU, upload intermediate results to the GPU, then let the GPU run the parallel steps where it excels. In practice, this runs into a fundamental problem: GPUs are physically distant from system memory. Even with DMA and high-bandwidth transfers, the round-trip latency often makes hybrid decoding slower than just doing the parallel steps on the CPU — especially given how capable modern SIMD-enabled CPUs have become. Real-world results with hybrid codec implementations have confirmed this. The dav1d decoder attempted to offload its final filter pass — complex but highly parallelizable — to the GPU, but saw no gain over the CPU, even on mobile. x264 added basic OpenCL™ support, but frame upload latency killed any performance advantage, and the code eventually bitrotted. These failures have left hybrid implementations with a poor reputation in the multimedia community. The lesson is clear: to be consistently fast, maintainable, and widely adopted, compute-based codec implementations need to be fully GPU-resident — no CPU hand-offs.

Where there's a will...

Most codecs are designed with ASIC hardware in mind — the dedicated video engines found on modern GPUs and exposed through Vulkan Video. But even ASICs aren't infinitely fast: codecs typically compromise and define a minimum unit of parallelizable work, called a slice or block, representing the smallest chunk that can be processed independently. Most popular codecs were designed decades ago, when video resolutions were far smaller. As resolutions have exploded, those fixed-size minimum units now represent a much smaller fraction of a frame — which means far more of them can be processed in parallel. Modern GPUs have also gained features enabling cross-invocation communication, opening up further optimization opportunities. Together, these trends make it genuinely feasible today to implement certain codecs entirely in compute shaders — no CPU involvement required. Compute-based encoders also have an advantage over ASICs that's easy to overlook: they're unconstrained in memory usage and search time. With enough threads to exhaustively scan each block, matching or even surpassing the quality of software encoders is entirely achievable.

Accessibility

FFmpeg is a free and open source collection of libraries and tools to enable working with multimedia streams, regardless of format or codec. Whilst famous for its codec implementations with handwritten assembly optimizations across multiple platforms, FFmpeg also provides easy access to hardware accelerators. Crucially, hardware acceleration in FFmpeg is built on top of the software codecs. Parsing of headers, threading, scheduling frames, slices, and error correction/handling all happen in software. Decoding of all video data is the only part offloaded. This combines robust well-tested code with hardware acceleration. We can directly translate the threading of independent frames that software implementations do by dispatching multiple frames for parallel decoding to fully saturate a GPU. It also allows users to switch between software and hardware implementations dynamically via a toggle, with no differentiation whether hardware decoding is implemented using Vulkan Video or Vulkan Compute shaders. The widespread usage of FFmpeg in editing software, media players and browsers, combined with the ability to add hardware accelerator support to any software implementation, makes it an ideal starting point for making compute-based codec implementations widely accessible, rather than dedicated library implementations.

FFv1

The FFmpeg Video Codec version #1, has become a staple of the archival community and in applications where lossless compression is required. It's open, royalty-free, and an official IETF standard. The work of implementing codecs in compute shaders in FFmpeg began here. The FFv1 encoder and decoder are very slow to run on a CPU, despite supporting up to 1024 slices. This was partly due to the huge bandwidth needed for high-resolution RGB video, and the somewhat bottlenecked entropy coding design. FFv1 version 3 was designed over 10 years ago, and it was thanks to the archival community, who adopted it, that it gained wide usage. However, the bottlenecks were making encoding and decoding of high resolution archival film scans prohibitively time consuming. Thus, thanks to the archival community, the FFv1 encoder and decoder were written. They started out as conversions of the software encoder and decoder, but were gradually more and more optimized with GPU-specific functions. The biggest challenge when encoding FFv1 is working with the range coder system, which lacks the optimizations that, for example, AV1's range coder has. Each symbol (pixel difference value) has each bit having its own 8-bit adaptation value, therefore needing to lookup 32 contiguous values randomly from a set of thousands (per plane!) when encoding or decoding. We speed this up by having a workgroup size of 32, with each local invocation looking up and performing adaptation in parallel, while a single invocation performs the actual encoding or decoding.

For RGB, a Reversible Color Transform (RCT) is performed to decorrelate pixel values further. Originally, a separate shader was used for this, which encoded to a separate image. However, the bandwidth requirements to do this for very high resolution images outweighed the advantages. Since only 2 lines are needed to decode or encode images, we allocate width*horizontal_slices*2 images, and perform the RCT ahead of encoding each line with the help of the 32 helper invocations.

APV

APV is a new codec designed by Samsung to serve as a royalty-free, open alternative for mezzanine video compression. Recently, it too became an IETF standard. It's gaining traction with the VFX and professional media production communities, as well as a camera recording format in smartphones. Unlike most codecs mentioned in this article, APV was designed for parallelism from the ground up. Similar to JPEG, each frame is subdivided into components, and each component is subdivided into tiles, with each tile featuring multiple blocks. Each block is simply transformed, quantized via a scalar quantizer (simple division), and encoded via variable length codes. There is not even any DC prediction. To implement it as a compute shader, we first handle decoding on each tile in one shader, and run a second shader which transforms a single block's row per invocation.

ProRes

ProRes is the de-facto standard mezzanine codec, used for editing, camera footage, and mastering. It's a relatively simple codec, similar to JPEG and APV, which made it possible to implement a decoder, and due to popular demand, an encoder. For decoding, we do essentially the same process as with APV. For encoding however, we do proper rate control and estimation by running a shader to find which quantizer makes a block fit within the frame’s bit budget. Unfortunately, unlike other codecs on the list, ProRes codecs are not royalty-free, nor have open specifications. The implementations in FFmpeg are unofficial. But due to their sheer popularity, such implementations are necessary for interoperability with much of the professional world. Nevertheless, the developers dogfood on the implementations, and their output is monitored to match the official implementations.

ProRes RAW

ProRes RAW features a bitstream that shares little in common with ProRes, because it was made for compressing RAW (not debayered) lossy sensor data. It uses a DCT performed on each component, and a coefficient coder which predicts DCs across components and efficiently encodes AC values from multiple components in a normal zigzag order. The entropy coding system is not exactly a traditional variable length code, but closer to exponential coding.

Slices feature multiple blocks, with each component being able to be decoded in parallel. Unlike FFv1, there are no limitations on the number of tiles per image, which potentially requires decoding hundreds of thousands of independent blocks. This is great for parallelism, leading to efficient implementations. The decoder was implemented in a 2-pass approach, with the first shader decoding each tile, and the second shader transforming all blocks within each tile with row/column parallelism (referred to as shred configuration due to being able to fully saturate a GPU's workgroup size limit).

DPX

DPX is not a codec, but rather a packed pixel packing container with a header. It’s an official SMPTE standard, and rather popular with film scanners. Rather than being optimally laid out and tightly packing pixels, it can pack pixels in 32-bit chunks, padding if needed. Or it can... not pack pixels, depending on a header switch. Its being an uncompressed format with loose regulations, made decades ago, means it's rife with vendors being rather creative in interpreting the specifications, in ways that completely break decoding. Thankfully, there's a text "producer" field left in the header for such implementations to sign their artistry with, which can be used to figure out how to correctly unpack without seeing alien rainbows. All of this comes down to just writing heuristics in shaders. The overhead will never be the calculations needed to find a collection of pixels, but actually pulling data from memory and writing it elsewhere.

VC-2

VC-2 is another mezzanine codec. Authored by the BBC, based on its Dirac codec, it is royalty-free, with official SMPTE specifications. Its primary use-case was real-time streaming, particularly fitting high resolution video over a gigabit connection with sub-frame latency. Unlike APV or ProRes, it is based on wavelet transforms. Each frame is subdivided into power-of-two sized slices. Wavelets are rather interesting as transforms. They subdivide a frame into a quarter-resolution image, and 3 more quarter-resolution images as residuals. Unlike DCTs, they are highly localized, which means they can be performed individually on each slice, yet when assembled they function as if the entire frame was transformed. This eliminates blocking artifacts that all DCT-based codecs suffer from. This also means they're less efficient to encode as their frequency decomposition is compromised. Also, their distortion characteristics are substantially less visually appealing than the blurring of DCTs. This was one of the main reasons they failed to gain traction in post-2000s codecs. The resulting coefficients are encoded via simple interleaved Golomb-exp codes, which, while not parallelizable, can be beautifully simplified in a decoder to remove all bit-parsing and instead operate on whole bytes.

JPEG

The codec given as an example at the start, turns out to have a very interesting attack that not only opens the door to parallelization, but also to parallelizing arbitrary data compression standards such as DEFLATE. The idea is that although VLC streams lack any way to parallelize, VLC decoders, and in fact all codes that satisfy the Kraft–McMillan inequality, can spuriously resynchronize. After a surprisingly short delay, VLC decoders tend to output valid data. All that's needed is to run 4 shaders to gradually synchronize the starting points within each JPEG stream. JPEG has multiple variants too, such as progressive and lossless profiles, which can also be parallelized to such an extent. DC prediction can be done via a parallel prefix sum, which is amongst the most common operations done via compute shaders. DCTs can be done via a shred configuration, as with other codecs.

Future

With the release of FFmpeg 8.1, we've implemented FFv1 encoding and decoding, ProRes encoding and decoding, ProRes RAW decoding, and DPX unpacking. GPU-based processing is automatically enabled and used if Vulkan-accelerated decoding is enabled. The VC-2 encoder and decoder, along with the JPEG and APV decoders, are still in progress and need additional work before they can be merged. Looking further ahead, the only remaining codecs with meaningful GPU acceleration potential are JPEG2000 and PNG — the rest either have limited practical use cases or don't benefit from compute-based acceleration. Unfortunately, JPEG2000 — and by extension JPEG2000HT — is unlike most modern codecs, burdened with the worst features of several combined: a semi-serialized coding system that requires extensive domain knowledge and a bitstream complex enough to give most modern bureaucracies pause. Software decoding of JPEG2000 ranks among the slowest of all widely-used codecs, owing to its ASIC-centric design and under-engineered arithmetic coder. Despite all this, it remains the primary codec used in digital cinema, medicine, and forensics. PNG acceleration is an open question: its viability as a GPU target will depend on how effectively DEFLATE can be parallelized.

Vulkan Compute

Vulkan is often pigeonholed as a graphics API with added compute — but that framing is outdated. Its compute capabilities have evolved to match, and in some cases exceed, dedicated compute APIs. Modern Vulkan offers pointers, extensive subgroup operations, shared memory aliasing, native bitwise operations, a well-defined memory model, shader specialization, 64-bit addressing, and direct access to GPU matrix units. Together, these features enable programmers to optimize at a lower level than more abstracted APIs. Even so, the Vulkan Compute API is not near its full potential as it doesn't yet expose the full capabilities of SPIR-V™, which as an intermediate representation is remarkably expressive. Support for the broader SPIR-V feature set is actively expanding — untyped pointers and 64-bit addressing are already available, and support for bitwise operations on non-32-bit integer types is on the way. Competing compute APIs from GPU vendors often bundle hundreds of specialized and specifically optimized algorithm implementations, accessible through more comfortable programming languages — a tempting package. The catch, of course, is vendor lock-in, which can be a serious concern for portable, long-lived software like FFmpeg. FFmpeg may be no stranger to writing its own implementations of popular algorithms to avoid dependencies, such as hashing functions, sorting algorithms, CRCs, or frequency transforms. But on the other hand, are extensive, object-orientated APIs, actually necessary? Often, formatting data to be used by common implementations takes longer and produces less optimal code than simply writing a small implementation of an algorithm specialized for a given use-case. OOP can in a lot of cases be handled by simply templating via a preprocessor. Linking multiple pieces of code could just be an #include. And, fragile code that targets a singular version of a vendor’s API, which in turn depends on a specific old gcc version, can be replaced by a reliable, lasting, self-sufficient shader. Vulkan is ubiquitous - from tiny SoCs, to tablets, embedded GPUs, discrete GPUs, and professional server GPUs — and its industry-led governance model creates strong incentives to support new extensions broadly. Constant automated testing is performed using a comprehensive conformance test suite. Lastly, Vulkan enjoys a broad ecosystem of debugging, optimization, and profiling tools, and a large global developer community means that almost any GPU quirk or optimization trick you discover has already been found, documented, and fed back into the specification. Whether using Vulkan Video or Vulkan compute shaders, Vulkan has become a compelling API to access GPU-accelerated video processing. FFmpeg download: https://ffmpeg.org/download.html Khronos® and Vulkan® are registered trademarks, and SPIR-V™ is a trademark of The Khronos Group Inc. OpenCL™ is a trademark of Apple Inc. used under license by Khronos. All other product names, trademarks, and/or company names are used solely for identification and belong to their respective owners.