稳定音频演示

稳定音频演示
Stable-Audio-Demo

原始链接: https://stability-ai.github.io/stable-audio-demo/

从本质上讲，stable-audio.com 展示了一种 AI 模型，该模型可以产生 44.1 kHz 的高质量、可变长度的立体声音乐和音效，超过了之前最先进的模型。通过输入“Berlin techno”或“disco”等各种提示，用户可以收听生成的作品和声音效果。这些音频样本不仅展示了模型创建复杂多样的音轨的能力，还展示了其具有突出且独特的声音运动的空间能力。与 Audiogen 和 AudioLDM2 等其他流行工具相比，Stable Audio 在音频保真度方面表现出色，重建版本与地面真实录音的比较证明了这一点。总体而言，这个创新平台代表了通过人工智能技术生成丰富且身临其境的音乐体验的重大进步。

哈哈，收获不错！这是对经典人工智能论点的巧妙转折。

⚠️ Warning: This website may not function properly on Safari. For the best experience, please use Google Chrome.

arXiv: Stable Audio’s paper

stable-audio-tools: code to reproduce Stable Audio

stable-audio-metrics: code to evaluate Stable Audio

Our model can generate variable-length and long-form stereo music at 44.1kHz:

Generated Stereo Music	Prompt
	Berlin techno, rave, drum machine, kick, ARP synthesizer, dark, moody, hypnotic, evolving, 135 BPM. Loop.
	Uplifting acoustic loop. 120 BPM.
	Disco, Driving Drum Machine, Synthesizer, Bass, Piano, Guitars, Instrumental, Clubby, Euphoric, Chicago, New York, 115 BPM.
	Calm meditation music to play in a spa lobby.
	Drum solo.

Differently from pervious state-of-the-art models, ours can generate stereo sound effects at 44.1kHz:

Generated Stereo Sounds	Prompt
	Door slam. High-quality, stereo.
	Sports car passing by. High-quality, stereo.
	Motorbike passing by. High-quality, stereo.
	Fireworks. High-quality, stereo.
	Reverberant footsteps inside a large rocky cave. High-quality, stereo.

Note that all the examples in this website are generated with the same model that can generate both variable-length music and sound effects at 44.1kHz stereo. We append “high-quality, stereo” to our sound effects prompts because it is generally helpful.

Long-form stereo music: comparison with state-of-the-art with MusicCaps prompts

Prompt: This song contains someone strumming a melody on a mandolin while more people are whistling along. Then a mandolin, an e-bass and an acoustic guitar are playing a short melody in a lower key before breaking into the next part along with flutes and percussions. This song may be played outside by musicians performing.

Our Model	MusicGen-large	MusicGen-stereo	AudioLDM2
(stereo, 44.1kHz)	(mono, 32kHz)	(stereo, 32kHz)	(mono, 48kHz)

Prompt: The commercial music features a groovy piano melody played over snare rolls in the first half of the loop. Right after, there is a drop that consists of a punchy “4 on the floor” kick pattern, shimmering hi hats, claps, groovy piano and wide synth lead melody. It sounds happy, fun, euphoric and exciting.

Our Model	MusicGen-large	MusicGen-stereo	AudioLDM2
(stereo, 44.1kHz)	(mono, 32kHz)	(stereo, 32kHz)	(mono, 48kHz)

These prompts/audios were used for the qualitative study we report in our paper.

Sound effects: comparison with state-of-the-art with AudioCaps prompts

Prompt: Clicking and sputtering then eventual revving of an idling engine.

Model	Audiogen-medium	AudioLDM2
(stereo, 44.1kHz)	(mono, 32kHz)	(mono, 48kHz)

Prompt: Birds chirping loudly.

Model	Audiogen-medium	AudioLDM2
(stereo, 44.1kHz)	(mono, 32kHz)	(mono, 48kHz)

These prompts/audios were used for the qualitative study we report in our paper. Note the (randomly) selected prompts from AudioCaps did not require substantial stereo movement, resulting in renders that are relatively non-spatial.

Autoencoder: reconstructions

This comparison is useful to evaluate the audio fidelity capabilities of the autoencoder. On the left, we have the ground truth recording. On the right, we take the ground truth recording and end pass it through the autoencoder. Note that the autoencoder reconstruction is fairly transparent, very close to the ground truth.

Ground truth	Autoencoder reconstruction