High-Fidelity Simultaneous Speech-to-Speech Translation

Hibiki, a new decoder-only model, advances simultaneous speech translation by directly processing both source and target speech using a multistream language model. This allows it to generate text and audio tokens concurrently for speech-to-text and speech-to-speech translation. Addressing the core challenge of real-time interpretation, Hibiki employs a novel weakly-supervised method. It leverages text translation perplexity to identify optimal, per-word delays, generating aligned synthetic training data that mimics the adaptive behavior of human interpreters. This enables Hibiki to translate chunk-by-chunk, adjusting its flow for sufficient context. After training, Hibiki performs adaptive simultaneous speech translation, exhibiting state-of-the-art performance on a French-English task regarding translation quality, speaker fidelity, and naturalness. Its simple inference process facilitates batched translation and even on-device, real-time deployment. The authors provide examples, models, and inference code.

A Hacker News discussion revolves around "Hibiki," a new high-fidelity simultaneous speech-to-speech translation system (French to English initially). Users are impressed but raise concerns about grammatical differences across languages, potential delays, and the impact on translator jobs. Some see it as a powerful tool that could increase cultural interaction by removing language barriers, while others lament the potential for superficial cultural understanding if language learning is abandoned. There's debate on whether AI can truly capture cultural nuances and context in translation, with some arguing human interpreters will still be needed. Users also discuss the value of language learning beyond practical translation and the potential for increased cultural understanding. Several similar technologies like Soniox and Yandex Browser are mentioned.

[Submitted on 5 Feb 2025 (v1), last revised 26 Feb 2025 (this version, v2)]

View a PDF of the paper titled High-Fidelity Simultaneous Speech-To-Speech Translation, by Tom Labiausse and 5 other authors

View PDF HTML (experimental)

Abstract:We introduce Hibiki, a decoder-only model for simultaneous speech translation. Hibiki leverages a multistream language model to synchronously process source and target speech, and jointly produces text and audio tokens to perform speech-to-text and speech-to-speech translation. We furthermore address the fundamental challenge of simultaneous interpretation, which unlike its consecutive counterpart, where one waits for the end of the source utterance to start translating, adapts its flow to accumulate just enough context to produce a correct translation in real-time, chunk by chunk. To do so, we introduce a weakly-supervised method that leverages the perplexity of an off-the-shelf text translation system to identify optimal delays on a per-word basis and create aligned synthetic data. After supervised training, Hibiki performs adaptive, simultaneous speech translation with vanilla temperature sampling. On a French-English simultaneous speech translation task, Hibiki demonstrates state-of-the-art performance in translation quality, speaker fidelity and naturalness. Moreover, the simplicity of its inference process makes it compatible with batched translation and even real-time on-device deployment. We provide examples as well as models and inference code.

From: Neil Zeghidour [view email]
[v1] Wed, 5 Feb 2025 17:18:55 UTC (711 KB)
[v2] Wed, 26 Feb 2025 09:31:58 UTC (711 KB)