SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
- URL: http://arxiv.org/abs/2308.11596v3
- Date: Wed, 25 Oct 2023 03:52:07 GMT
- Title: SeamlessM4T: Massively Multilingual & Multimodal Machine Translation
- Authors: Seamless Communication, Lo\"ic Barrault, Yu-An Chung, Mariano Cora
Meglioli, David Dale, Ning Dong, Paul-Ambroise Duquenne, Hady Elsahar, Hongyu
Gong, Kevin Heffernan, John Hoffman, Christopher Klaiber, Pengwei Li, Daniel
Licht, Jean Maillard, Alice Rakotoarison, Kaushik Ram Sadagopan, Guillaume
Wenzek, Ethan Ye, Bapi Akula, Peng-Jen Chen, Naji El Hachem, Brian Ellis,
Gabriel Mejia Gonzalez, Justin Haaheim, Prangthip Hansanti, Russ Howes,
Bernie Huang, Min-Jae Hwang, Hirofumi Inaguma, Somya Jain, Elahe Kalbassi,
Amanda Kallet, Ilia Kulikov, Janice Lam, Daniel Li, Xutai Ma, Ruslan
Mavlyutov, Benjamin Peloquin, Mohamed Ramadan, Abinesh Ramakrishnan, Anna
Sun, Kevin Tran, Tuan Tran, Igor Tufanov, Vish Vogeti, Carleigh Wood, Yilin
Yang, Bokai Yu, Pierre Andrews, Can Balioglu, Marta R. Costa-juss\`a, Onur
Celebi, Maha Elbayad, Cynthia Gao, Francisco Guzm\'an, Justine Kao, Ann Lee,
Alexandre Mourachko, Juan Pino, Sravya Popuri, Christophe Ropers, Safiyyah
Saleem, Holger Schwenk, Paden Tomasello, Changhan Wang, Jeff Wang, Skyler
Wang
- Abstract summary: We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
- Score: 90.71078166159295
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: What does it take to create the Babel Fish, a tool that can help individuals
translate speech between any two languages? While recent breakthroughs in
text-based models have pushed machine translation coverage beyond 200
languages, unified speech-to-speech translation models have yet to achieve
similar strides. More specifically, conventional speech-to-speech translation
systems rely on cascaded systems that perform translation progressively,
putting high-performing unified systems out of reach. To address these gaps, we
introduce SeamlessM4T, a single model that supports speech-to-speech
translation, speech-to-text translation, text-to-speech translation,
text-to-text translation, and automatic speech recognition for up to 100
languages. To build this, we used 1 million hours of open speech audio data to
learn self-supervised speech representations with w2v-BERT 2.0. Subsequently,
we created a multimodal corpus of automatically aligned speech translations.
Filtered and combined with human-labeled and pseudo-labeled data, we developed
the first multilingual system capable of translating from and into English for
both speech and text. On FLEURS, SeamlessM4T sets a new standard for
translations into multiple target languages, achieving an improvement of 20%
BLEU over the previous SOTA in direct speech-to-text translation. Compared to
strong cascaded models, SeamlessM4T improves the quality of into-English
translation by 1.3 BLEU points in speech-to-text and by 2.6 ASR-BLEU points in
speech-to-speech. Tested for robustness, our system performs better against
background noises and speaker variations in speech-to-text tasks compared to
the current SOTA model. Critically, we evaluated SeamlessM4T on gender bias and
added toxicity to assess translation safety. Finally, all contributions in this
work are open-sourced and accessible at
https://github.com/facebookresearch/seamless_communication
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Direct Punjabi to English speech translation using discrete units [4.883313216485195]
We present a direct speech-to-speech translation model for one of the Indic languages called Punjabi to English.
We also explore the performance of using a discrete representation of speech called discrete acoustic units as input to the Transformer-based translation model.
Our results show that the U2UT model performs better than the Speech-to-Unit Translation (S2UT) model by a 3.69 BLEU score.
arXiv Detail & Related papers (2024-02-25T03:03:34Z) - Seamless: Multilingual Expressive and Streaming Speech Translation [71.12826355107889]
We introduce a family of models that enable end-to-end expressive and multilingual translations in a streaming fashion.
First, we contribute an improved version of the massively multilingual and multimodal SeamlessM4T model- SeamlessM4T v2.
We bring major components from SeamlessExpressive and SeamlessStreaming together to form Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real-time.
arXiv Detail & Related papers (2023-12-08T17:18:42Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
Speech-to-Speech Translations [38.058120432870126]
SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations.
It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
arXiv Detail & Related papers (2022-11-08T19:09:27Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.