Magenta RealTime 2：开放且本地化的实时音乐模型

Magenta RealTime 2：开放且本地化的实时音乐模型
Magenta RealTime 2: Open and Local Live Music Models

原始链接: https://magenta.withgoogle.com/magenta-realtime-2

**Magenta RealTime 2 (MRT2)** 是一套流式音频生成系统。它将前代系统的分块处理方式升级为帧级自回归，从而将系统响应延迟从两秒缩短至约 40 毫秒，实现了近乎即时的音乐控制。在架构上，MRT2 采用了解码器（decoder-only）模型，结合滑动窗口注意力机制与可学习的注意力汇聚点（attention sink）。这消除了双向编码器的顺序瓶颈，确保了系统随时间推移的性能稳定性。为实现灵活且灵敏的交互，该模型支持多信号条件控制，包括音频/文本风格提示、基于 MIDI 的音符控制以及鼓点切换。主要技术特点包括： * **精确控制：** 通过流式交叉注意力机制，在每一帧中集成条件控制信号。 * **动态引导：** 采用多信号无分类器引导（CFG）框架，允许用户平衡不同输入的影响力。 * **推理时掩码：** 训练阶段的随机掩码使模型在推理时能够动态切换受限（精确）模式与创意（生成）模式。 * **位置灵活性：** 通过以因果掩码和滑动窗口注意力取代标准位置编码，模型能有效泛化至不同长度的序列，且不会出现性能下降。总而言之，MRT2 为实时生成式音乐表演提供了一个响应迅速且高度可控的框架。

抱歉。

Low-latency streaming generation

Some background on Codec Language Modeling. A codec language model (LM) operates on discrete sequences of tokens from a neural audio codec. Here a codec refers to a pair of functions, an encoder and decoder, that convert audio to and from a discrete, compressed representation while minimizing distortion.

More formally, the encoder is a function mapping raw stereo audio waveforms \(\textbf{a} \in \mathbb{R}^{T f_s \times 2}\) into matrices of discrete tokens \(\mathbf{x} \in \mathbb{V}_c^{Tf_k \times d_c}\) where \(T\) is the duration in seconds, \(f_s\) the audio sampling rate, \(f_k\) the token frame rate, \(\mathbb{V}_c\) the codec vocabulary, and \(d_c\) is the number of tokens per frame. In this case, \(d_c\) refers to the “depth” of the residual vector quantization algorithm, referring to the iterative quantization of continuous embeddings of each audio frame.

The goal of the codec LM is to model these token matrices. For efficiency, an increasingly common approach is to adopt a hierarchical autoregressive framework using a pair of Transformers: one which compresses temporal history into fixed-length embedding vectors (\(\texttt{Temporal}_\theta\)), and another which iteratively decodes tokens depth-wise given the current frame embedding (\(\texttt{Depth}_\phi\)). Assuming \(\mathbf{x_i}\) refers to the \(i\)-th frame of \(\mathbf{x}\), and \(x_i^j\) refers to its \(j\)-th token, the joint distribution over \(x\) is modeled autoregressively as: \[ P_{\theta,\phi}(\mathbf{x}) = \prod_{i=1}^{Tf_k} \prod_{j=1}^{d_c} P_\phi(x_i^j | \mathbf{x_i^{<j}}, \texttt{Temporal}_{\theta}(\mathbf{x_{<i}})), \] where \(P_\phi(x_i^j \mid \cdot) = \texttt{SoftMax}(\texttt{Depth}_\phi(\cdot))\).

At inference time, we generate audio by first sampling a token sequence \(\mathbf{x’} \sim P_{\theta,\phi}(\mathbf{x})\) and then outputting \(\mathbf{a}’ = \texttt{Dec}(\mathbf{x}’)\), where \(\texttt{Dec}\) is the codec decoder. This describes our base modeling approach, shared with Magenta RealTime. For our codec, we use SpectroStream to compress high fidelity (\(f_s = 48\) kHz) stereo audio into tokens at \(3\) kbps (\(f_k = 25\) Hz, \(d_c = 12\), \(|\mathbb{V}_c| = 2^{10}\)).

Lowering autoregression granularity: from chunk to frame. To achieve streaming audio generation, we need to enforce two constraints:

The system must generate at least \(f_k \cdot d_c\) tokens per second
The decoder must be causal, meaning its output audio for frame \(i\) only depends on \(\mathbf{x_{\leq i}}\)

In the original Magenta RealTime, we satisfied requirement (1) by performing autoregression on chunks of frames, where each chunk is 2 seconds in duration. This design was chosen to amortize model runtime over chunk length to achieve real-time streaming. However, because the system must wait until the next chunk to inject any new user control information, the chunk duration creates a lower bound on control delay, resulting in a response time of 2 seconds at a minimum. Instead, Magenta RealTime 2 models individual frames, allowing us to reduce model response time significantly. To ensure continuous streaming generation while operating on single frames, we adopt a decoder-only architecture, using a local sliding window attention (SWA) in the temporal Transformer.

This has two key advantages: (1) the decoder-only architecture allows us to remove the sequential bottleneck introduced by the bidirectional encoder in Magenta RealTime, where the full encoder output has to be materialized before decoding can begin; (2) the rolling attention mechanism allows us to extend the context length while keeping the KV cache size fixed. At each step of the autoregressive generation, key-value entries for new tokens are written into the cache, and entries older than the window size w are evicted:

Similarly to previous work, we find that using a sliding window attention causes the model to significantly deteriorate when initial tokens are evicted from the cache. To remediate this, we make use of a learnable attention sink embedding. In order to reconcile the finite training length with the receptive field induced by the SWA mechanism, we also take care to set the attention window size such that this effective receptive field does not exceed the training crop length. Finally, we further reduce train/test mismatch and achieve better length generalization by dropping learnable positional embeddings (NoPE), after observing that RoPE hinders generalization beyond the training length. Instead, the model implicitly learns positional information by relying on causal masking and SWA, which naturally extend to arbitrary-length sequences without extrapolation issues.

Putting all this together, our model presents significant architectural differences compared to the previous version:

Model	Magenta RealTime	Magenta RealTime 2
Autoregressive unit	2-second chunks (25 frames × 16 RVQ = 400 tokens)	Individual frames (12 RVQ tokens at 25 Hz = 40 ms)
Architecture	T5-style bidirectional encoder + causal decoder; encoder processes the full chunk of conditioning before decoding begins	Decoder-only; conditioning is injected at every frame, with no encoder forward pass as a sequential bottleneck
Minimum control delay	≥ 2 s (next chunk boundary)	~0.2 s (frame processing + depth decode + codec decode). See full latency diagram

Precise control through frame-by-frame conditioning

A central feature of MRT2 is responsive, multi-signal control: in addition to style control expressed through audio or text, MRT2 also supports note and drums on/off control. This is achieved by modeling the conditional distribution \(P_{\theta,\phi}(\mathbf{x} | \mathbf{c})\), where \(\mathbf{c} = (\mathbf{c}_{style}, \mathbf{c}_{notes}, \mathbf{c}_{drums})\) is formed by tokenized representations of all conditioning signals at the audio frame rate (25 Hz), concatenated together into a single conditioning vector per frame. This vector is then mapped to a multi-channel embedding and injected into the temporal decoder through streaming cross-attention, enabling the model to react to changes in any signal within a single frame (~40 ms).

At inference we enable flexible joint guidance by extending the classifier-free guidance (CFG) approach in Magenta RealTime to multiple signals. This allows us to balance the contribution of each conditioning signal separately and according to the desired level of adherence, while also supporting unconditional generation for any subset of controls.

Style control through audio and text. Similarly to Magenta RealTime, MRT2 can also be steered through audio and text via quantized MusicCoCa embeddings. During training, we freeze the embeddings associated with the MusicCoCa tokens instead of learning them from scratch. The goal is to leverage the rich, pre-trained semantic representations coming from the Residual Vector Quantizer (RVQ). By keeping these embeddings frozen, we ensure the generative model receives stable semantic embeddings, which significantly improves prompt adherence at inference time. While MusicCoCa provides a joint embedding space between text and audio, the underlying distributions associated with both modalities do not match exactly. This creates a train-test mismatch during inference, as the model has only been trained on audio embeddings, but receives text embeddings during inference. To bridge this gap, we train a generative model from which we can sample diverse audio embeddings given an input text embedding, learning the one-to-many relationship between a single text prompt and multiple valid audio signals. To ensure high performance, we employ a pixel Mean Flow (pMF) formulation, enabling high-quality one-step inference. Finally, training this mapper module on a mix of short tags and long-form captions provides flexible style control, ranging from simple tag-style inputs to highly detailed text descriptions.

Note control. We enable note control by training on (audio, MIDI) pairs. Note activity is encoded as a 128-channel pianoroll – one channel per MIDI pitch – at the audio frame rate (25 Hz). The model is trained on around 71k hours of mostly instrumental stock music from a variety of sources, with MIDI labels inferred by the MT3 transcription model. We structure the per-pitch token vocabulary to support two control modes at inference. In Auto-Strum mode, the user specifies only which pitches are active at each frame, and the model determines where to place note onsets. In Auto-Strum OFF mode, the user can additionally specify the exact timing of each note onset, giving precise attack-level control. This is achieved through a 4-token vocabulary that distinguishes between note off, generic note on, note onset, and note continuation. When Auto-Strum is off, the model receives onset and continuation tokens directly, and respects the specified attack timing. When Auto-Strum is on, onset information is replaced with an onset mask token, and the model freely chooses when to place attacks based on the active pitch information alone. To support both modes with a single model, we employ onset masking, a training-time augmentation that stochastically replaces the onset and continuation tokens of randomly selected notes with the onset mask token. This trains the model to generate musically plausible attacks when no explicit onset information is present, while faithfully following onset cues when they are provided.

Drums on/off control. The note conditioning described so far gives us control over the melodic and harmonic content of the generated audio, but leaves us with no mechanism to control the presence of percussive elements. As a result, the model can arbitrarily include drums as part of the generated audio whenever this is admissible by the style conditioning (e.g. “jazz”). This can often be undesirable if, for example, the model is played alongside other instruments or as part of a multi-track session (e.g. in a DAW). For this reason, it’s useful to optionally switch off drum generation through an explicit control. We enable this through an additional conditioning signal: at training time, we pass a frame-wise sequence of drum hits obtained by transcribing drum stems from each training example using OaF Drums. While this trains the model to respond to drum hits, we find that direct drum control is infeasible in practice, given the end-to-end response time. Instead, we leverage this control purely for switching between drum-unconditional and drumless generation, using the same multi-guidance CFG as the other signals.

Inference-time masking as creative control. Beyond providing a set of control signals to guide generation, it is crucial to have a way to compose and modulate them. We accomplish this through selective input masking coupled with CFG scales, a technique that allows us to flexibly define playing modes at inference. More specifically, we introduce a masking scheme designed to accomplish two complementary goals: (1) strengthen the model’s ability to follow the controls while remaining robust to noisy or missing inputs, (2) enable partially unconditional generation as a form of creative control. During training, we stochastically mask contiguous regions of each conditioning signal independently, varying both the masking probability and spatial scale. We find that this results in better adherence to the inputs when they are specified. Importantly, this augmentation implicitly trains the model to interpret masked regions as unspecified, opening up a new dimension of creative interaction at inference. The Auto-Strum mode described above in the Note Control section is one such example. Similarly, we employ masking over the pitch dimension of the pianoroll to give the model more or less “creative freedom” over which pitches can be active. For example, masking all pianoroll pitches except those currently pressed allows the model to freely add harmonies or embellishments, while explicitly setting neighboring pitches to “off” (silent) constrains it to play only the input notes.

Real-world control latency. While we have significantly reduced the model frame size (from 2s to 40ms) compared to the previous generation, inference time isn’t the only source of latency. Below we give a sketch of end-to-end reaction time, taking into account input and output buffers, alongside additional sources of latency introduced by external components.