Seamless: Multilingual Expressive and Streaming Speech Translation
- URL: http://arxiv.org/abs/2312.05187v1
- Date: Fri, 8 Dec 2023 17:18:42 GMT
- Title: Seamless: Multilingual Expressive and Streaming Speech Translation
- Authors: Seamless Communication, Lo\"ic Barrault, Yu-An Chung, Mariano Coria
Meglioli, David Dale, Ning Dong, Mark Duppenthaler, Paul-Ambroise Duquenne,
Brian Ellis, Hady Elsahar, Justin Haaheim, John Hoffman, Min-Jae Hwang,
Hirofumi Inaguma, Christopher Klaiber, Ilia Kulikov, Pengwei Li, Daniel
Licht, Jean Maillard, Ruslan Mavlyutov, Alice Rakotoarison, Kaushik Ram
Sadagopan, Abinesh Ramakrishnan, Tuan Tran, Guillaume Wenzek, Yilin Yang,
Ethan Ye, Ivan Evtimov, Pierre Fernandez, Cynthia Gao, Prangthip Hansanti,
Elahe Kalbassi, Amanda Kallet, Artyom Kozhevnikov, Gabriel Mejia Gonzalez,
Robin San Roman, Christophe Touret, Corinne Wong, Carleigh Wood, Bokai Yu,
Pierre Andrews, Can Balioglu, Peng-Jen Chen, Marta R. Costa-juss\`a, Maha
Elbayad, Hongyu Gong, Francisco Guzm\'an, Kevin Heffernan, Somya Jain,
Justine Kao, Ann Lee, Xutai Ma, Alex Mourachko, Benjamin Peloquin, Juan Pino,
Sravya Popuri, Christophe Ropers, Safiyyah Saleem, Holger Schwenk, Anna Sun,
Paden Tomasello, Changhan Wang, Jeff Wang, Skyler Wang, Mary Williamson
- Abstract summary: We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
- Score: 71.12826355107889
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large-scale automatic speech translation systems today lack key features that
help machine-mediated communication feel seamless when compared to
human-to-human dialogue. In this work, we introduce a family of models that
enable end-to-end expressive and multilingual translations in a streaming
fashion. First, we contribute an improved version of the massively multilingual
and multimodal SeamlessM4T model-SeamlessM4T v2. This newer model,
incorporating an updated UnitY2 framework, was trained on more low-resource
language data. SeamlessM4T v2 provides the foundation on which our next two
models are initiated. SeamlessExpressive enables translation that preserves
vocal styles and prosody. Compared to previous efforts in expressive speech
research, our work addresses certain underexplored aspects of prosody, such as
speech rate and pauses, while also preserving the style of one's voice. As for
SeamlessStreaming, our model leverages the Efficient Monotonic Multihead
Attention mechanism to generate low-latency target translations without waiting
for complete source utterances. As the first of its kind, SeamlessStreaming
enables simultaneous speech-to-speech/text translation for multiple source and
target languages. To ensure that our models can be used safely and responsibly,
we implemented the first known red-teaming effort for multimodal machine
translation, a system for the detection and mitigation of added toxicity, a
systematic evaluation of gender bias, and an inaudible localized watermarking
mechanism designed to dampen the impact of deepfakes. Consequently, we bring
major components from SeamlessExpressive and SeamlessStreaming together to form
Seamless, the first publicly available system that unlocks expressive
cross-lingual communication in real-time. The contributions to this work are
publicly released and accessible at
https://github.com/facebookresearch/seamless_communication
Related papers
- SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought [12.54786997634534]
This work proposes SeamlessExpressiveLM, a single speech language model for expressive S2ST.
We decompose the complex source-to-target speech mapping into intermediate generation steps with chain-of-thought prompting.
The model is first guided to translate target semantic content and then transfer the speaker style to multi-stream acoustic units.
arXiv Detail & Related papers (2024-05-30T18:28:31Z) - TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Direct Punjabi to English speech translation using discrete units [4.883313216485195]
We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English.
We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model.
Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
arXiv Detail & Related papers (2024-02-25T03:03:34Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - BeAts: Bengali Speech Acts Recognition using Multimodal Attention Fusion [0.0]
We develop a novel approach combining two models, wav2vec2.0 for audio and MarianMT for text translation, to predict speech acts.
We also show that our model BeAts ($underlinetextbfBe$ngali speech acts recognition using Multimodal $underlinetextbfAt$tention Fu$underlinetextbfs$ion.
arXiv Detail & Related papers (2023-06-05T08:12:17Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - Speaker Independent and Multilingual/Mixlingual Speech-Driven Talking
Head Generation Using Phonetic Posteriorgrams [58.617181880383605]
In this work, we propose a novel approach using phonetic posteriorgrams.
Our method doesn't need hand-crafted features and is more robust to noise compared to recent approaches.
Our model is the first to support multilingual/mixlingual speech as input with convincing results.
arXiv Detail & Related papers (2020-06-20T16:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.