微软 VibeVoice：开源前沿语音人工智能

微软 VibeVoice：开源前沿语音人工智能
Microsoft VibeVoice: Open-Source Frontier Voice AI

原始链接: https://github.com/microsoft/VibeVoice

## VibeVoice：开源前沿语音AI VibeVoice是微软推出的一系列开源AI模型，专注于长篇语音任务，包括自动语音识别（ASR）和文本转语音（TTS）。其关键创新在于使用连续语音分词器，以高效处理长音频/文本——ASR最长可达60分钟，TTS最长可达90分钟。 **主要特性包括：** * **VibeVoice-ASR：** 一次性转录60分钟音频，识别说话人、时间戳和内容，并具有可定制的热词功能以提高准确性。现已通过Hugging Face Transformers提供。 * **VibeVoice-TTS：** 生成最长90分钟的多说话人语音（最多4个说话人），具有自然的声音表现力。 * **VibeVoice-Realtime：** 一个轻量级的0.5B参数TTS模型，用于实时流式语音生成。支持九种语言以及多种英语风格。最初，由于滥用问题，移除了一个TTS组件。微软强调负责任的AI使用，并警告潜在风险，如深度伪造和虚假信息。VibeVoice目前仅供研究和开发使用，在商业部署前需要进一步测试。

微软已开源“VibeVoice”，一种在GitHub上提供的语音AI模型。该项目专注于语音转文本（STT/ASR）、长篇文本转语音（TTS）和流式TTS功能——扩展了最初的语音识别功能。 Hacker News上的讨论指出，该项目之前因安全问题而被撤回，但已重新发布，并移除了一些模型（特别是原始TTS）。用户将VibeVoice与Parakeet和Whisper等更小、更高效的模型进行比较，指出VibeVoice在更高准确性和说话人识别方面的潜力是以体积为代价的。评论中的一个玩笑集中在使用“vibe”一词来描述AI交互，有人猜测它可能是“年度词汇”的竞争者。该帖子还包括关于Y Combinator 2026年夏季申请截止日期的提醒。

原文

2026-03-06: 🚀 VibeVoice ASR is now part of a Transformers release! You can now use our speech recognition model directly through the Hugging Face Transformers library for seamless integration into your projects.

2026-01-21: 📣 We open-sourced VibeVoice-ASR, a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for User-Customized Context. Try it in Playground.

2025-12-16: 📣 We added experimental speakers to VibeVoice‑Realtime‑0.5B for exploration, including multilingual voices in nine languages (DE, FR, IT, JP, KR, NL, PL, PT, ES) and 11 distinct English style voices. Try it. More speaker types will be added over time.

2025-12-03: 📣 We open-sourced VibeVoice‑Realtime‑0.5B, a real‑time text‑to‑speech model that supports streaming text input and robust long-form speech generation. Try it on Colab.

2025-09-05: VibeVoice is an open-source research framework intended to advance collaboration in the speech synthesis community. After release, we discovered instances where the tool was used in ways inconsistent with the stated intent. Since responsible use of AI is one of Microsoft’s guiding principles, we have removed the VibeVoice-TTS code from this repository.

2025-08-25: 📣 We open-sourced VibeVoice-TTS, a long-form multi-speaker text-to-speech model that can synthesize speech up to 90 minutes long with up to 4 distinct speakers. — accepted as an Oral at ICLR 2026! 🔥

VibeVoice is a family of open-source frontier voice AI models that includes both Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models.

A core innovation of VibeVoice is its use of continuous speech tokenizers (Acoustic and Semantic) operating at an ultra-low frame rate of 7.5 Hz. These tokenizers efficiently preserve audio fidelity while significantly boosting computational efficiency for processing long sequences. VibeVoice employs a next-token diffusion framework, leveraging a Large Language Model (LLM) to understand textual context and dialogue flow, and a diffusion head to generate high-fidelity acoustic details.

For more information, demos, and examples, please visit our Project Page.

VibeVoice-ASR is a unified speech-to-text model designed to handle 60-minute long-form audio in a single pass, generating structured transcriptions containing Who (Speaker), When (Timestamps), and What (Content), with support for Customized Hotwords.

🕒 60-minute Single-Pass Processing: Unlike conventional ASR models that slice audio into short chunks (often losing global context), VibeVoice ASR accepts up to 60 minutes of continuous audio input within 64K token length. This ensures consistent speaker tracking and semantic coherence across the entire hour.
👤 Customized Hotwords: Users can provide customized hotwords (e.g., specific names, technical terms, or background info) to guide the recognition process, significantly improving accuracy on domain-specific content.
📝 Rich Transcription (Who, When, What): The model jointly performs ASR, diarization, and timestamping, producing a structured output that indicates who said what and when.

📖 Documentation | 🤗 Hugging Face | 🎮 Playground | 🛠️ Finetuning | 📊 Paper

Best for: Long-form conversational audio, podcasts, multi-speaker dialogues

⏱️ 90-minute Long-form Generation: Synthesizes conversational/single-speaker speech up to 90 minutes in a single pass, maintaining speaker consistency and semantic coherence throughout.
👥 Multi-speaker Support: Supports up to 4 distinct speakers in a single conversation, with natural turn-taking and speaker consistency across long dialogues.
🎭 Expressive Speech: Generates expressive, natural-sounding speech that captures conversational dynamics and emotional nuances.
🌐 Multi-lingual Support: Supports English, Chinese and other languages.

📖 Documentation | 🤗 Hugging Face | 📊 Paper

English

Chinese

Cross-Lingual

Spontaneous Singing

Long Conversation with 4 people

VibeVoice-Realtime is a lightweight real‑time text-to-speech model supporting streaming text input and robust long-form speech generation.

Parameter size: 0.5B (deployment-friendly)
Real-time TTS (~300 milliseconds first audible latency)
Streaming text input
Robust long-form speech generation (~10 minutes)

📖 Documentation | 🤗 Hugging Face | 🚀 Colab

Please see CONTRIBUTING.md for detailed contribution guidelines.

⚠️ Risks and Limitations

While efforts have been made to optimize it through various techniques, it may still produce outputs that are unexpected, biased, or inaccurate. VibeVoice inherits any biases, errors, or omissions produced by its base model (specifically, Qwen2.5 1.5b in this release). Potential for Deepfakes and Disinformation: High-quality synthetic speech can be misused to create convincing fake audio content for impersonation, fraud, or spreading disinformation. Users must ensure transcripts are reliable, check content accuracy, and avoid using generated content in misleading ways. Users are expected to use the generated content and to deploy the models in a lawful manner, in full compliance with all applicable laws and regulations in the relevant jurisdictions. It is best practice to disclose the use of AI when sharing AI-generated content.

We do not recommend using VibeVoice in commercial or real-world applications without further testing and development. This model is intended for research and development purposes only. Please use responsibly.

微软 VibeVoice：开源前沿语音人工智能 Microsoft VibeVoice: Open-Source Frontier Voice AI

⚠️ Risks and Limitations

微软 VibeVoice：开源前沿语音人工智能
Microsoft VibeVoice: Open-Source Frontier Voice AI