SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
Speech-to-Speech Translations
- URL: http://arxiv.org/abs/2211.04508v1
- Date: Tue, 8 Nov 2022 19:09:27 GMT
- Title: SpeechMatrix: A Large-Scale Mined Corpus of Multilingual
Speech-to-Speech Translations
- Authors: Paul-Ambroise Duquenne, Hongyu Gong, Ning Dong, Jingfei Du, Ann Lee,
Vedanuj Goswani, Changhan Wang, Juan Pino, Beno\^it Sagot, Holger Schwenk
- Abstract summary: SpeechMatrix is a large-scale multilingual corpus of speech-to-speech translations.
It contains speech alignments in 136 language pairs with a total of 418 thousand hours of speech.
- Score: 38.058120432870126
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present SpeechMatrix, a large-scale multilingual corpus of
speech-to-speech translations mined from real speech of European Parliament
recordings. It contains speech alignments in 136 language pairs with a total of
418 thousand hours of speech. To evaluate the quality of this parallel speech,
we train bilingual speech-to-speech translation models on mined data only and
establish extensive baseline results on EuroParl-ST, VoxPopuli and FLEURS test
sets. Enabled by the multilinguality of SpeechMatrix, we also explore
multilingual speech-to-speech translation, a topic which was addressed by few
other works. We also demonstrate that model pre-training and sparse scaling
using Mixture-of-Experts bring large gains to translation performance. The
mined data and models are freely available.
Related papers
- Cross-Lingual Transfer Learning for Speech Translation [7.802021866251242]
This paper examines how to expand the speech translation capability of speech foundation models with restricted data.
Whisper, a speech foundation model with strong performance on speech recognition and English translation, is used as the example model.
Using speech-to-speech retrieval to analyse the audio representations generated by the encoder, we show that utterances from different languages are mapped to a shared semantic space.
arXiv Detail & Related papers (2024-07-01T09:51:48Z) - TranSentence: Speech-to-speech Translation via Language-agnostic
Sentence-level Speech Encoding without Language-parallel Data [44.83532231917504]
TranSentence is a novel speech-to-speech translation without language-parallel speech data.
We train our model to generate speech based on the encoded embedding obtained from a language-agnostic sentence-level speech encoder.
We extend TranSentence to multilingual speech-to-speech translation.
arXiv Detail & Related papers (2024-01-17T11:52:40Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Textless Speech-to-Speech Translation With Limited Parallel Data [51.3588490789084]
PFB is a framework for training textless S2ST models that require just dozens of hours of parallel speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
arXiv Detail & Related papers (2023-05-24T17:59:05Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - FT Speech: Danish Parliament Speech Corpus [21.190182627955817]
This paper introduces FT Speech, a new speech corpus created from the recorded meetings of the Danish Parliament.
The corpus contains over 1,800 hours of transcribed speech by a total of 434 speakers.
It is significantly larger in duration, vocabulary, and amount of spontaneous speech than the existing public speech corpora for Danish.
arXiv Detail & Related papers (2020-05-25T19:51:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.