Multilingual Speech-to-Speech Translation into Multiple Target Languages
- URL: http://arxiv.org/abs/2307.08655v1
- Date: Mon, 17 Jul 2023 17:12:44 GMT
- Title: Multilingual Speech-to-Speech Translation into Multiple Target Languages
- Authors: Hongyu Gong, Ning Dong, Sravya Popuri, Vedanuj Goswami, Ann Lee, Juan
Pino
- Abstract summary: Speech-to-speech translation (S2ST) enables spoken communication between people talking in different languages.
We present the first work on multilingual S2ST supporting multiple target languages.
Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we equip these key components with multilingual capability.
- Score: 23.427886305667833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech-to-speech translation (S2ST) enables spoken communication between
people talking in different languages. Despite a few studies on multilingual
S2ST, their focus is the multilinguality on the source side, i.e., the
translation from multiple source languages to one target language. We present
the first work on multilingual S2ST supporting multiple target languages.
Leveraging recent advance in direct S2ST with speech-to-unit and vocoder, we
equip these key components with multilingual capability. Speech-to-masked-unit
(S2MU) is the multilingual extension of S2U, which applies masking to units
which don't belong to the given target language to reduce the language
interference. We also propose multilingual vocoder which is trained with
language embedding and the auxiliary loss of language identification. On
benchmark translation testsets, our proposed multilingual model shows superior
performance than bilingual models in the translation from English into $16$
target languages.
Related papers
- MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation [45.558316325252335]
Multitask Speech Language Model (MSLM) is a decoder-only speech language model trained in a multitask setting.
Our model is able to support multilingual S2ST with speaker style preserved.
arXiv Detail & Related papers (2024-03-19T03:35:20Z) - Towards a Deep Understanding of Multilingual End-to-End Speech
Translation [52.26739715012842]
We analyze representations learnt in a multilingual end-to-end speech translation model trained over 22 languages.
We derive three major findings from our analysis.
arXiv Detail & Related papers (2023-10-31T13:50:55Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - Scaling Speech Technology to 1,000+ Languages [66.31120979098483]
The Massively Multilingual Speech (MMS) project increases the number of supported languages by 10-40x, depending on the task.
Main ingredients are a new dataset based on readings of publicly available religious texts.
We built pre-trained wav2vec 2.0 models covering 1,406 languages, a single multilingual automatic speech recognition model for 1,107 languages, speech synthesis models for the same number of languages, and a language identification model for 4,017 languages.
arXiv Detail & Related papers (2023-05-22T22:09:41Z) - Magic dust for cross-lingual adaptation of monolingual wav2vec-2.0 [7.378368959253632]
We show that a monolingual wav2vec-2.0 is a good few-shot ASR learner in several languages.
A key finding of this work is that the adapted monolingual wav2vec-2.0 achieves similar performance as the topline multilingual XLSR model.
arXiv Detail & Related papers (2021-10-07T15:29:22Z) - Discovering Representation Sprachbund For Multilingual Pre-Training [139.05668687865688]
We generate language representation from multilingual pre-trained models and conduct linguistic analysis.
We cluster all the target languages into multiple groups and name each group as a representation sprachbund.
Experiments are conducted on cross-lingual benchmarks and significant improvements are achieved compared to strong baselines.
arXiv Detail & Related papers (2021-09-01T09:32:06Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - CoVoST 2 and Massively Multilingual Speech-to-Text Translation [24.904548615918355]
CoVoST 2 is a large-scale multilingual speech translation corpus covering translations from 21 languages into English and from English into 15 languages.
This represents the largest open dataset available to date from total volume and language coverage perspective.
arXiv Detail & Related papers (2020-07-20T17:53:35Z) - Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario [10.779568857641928]
This paper presents an extension on Tacotron2 to achieve bilingual multispeaker speech synthesis.
We achieve cross-lingual synthesis, including code-switching cases, between English and Mandarin for monolingual speakers.
arXiv Detail & Related papers (2020-05-21T03:03:34Z) - Learning to Scale Multilingual Representations for Vision-Language Tasks [51.27839182889422]
The effectiveness of SMALR is demonstrated with ten diverse languages, over twice the number supported in vision-language tasks to date.
We evaluate on multilingual image-sentence retrieval and outperform prior work by 3-4% with less than 1/5th the training parameters compared to other word embedding methods.
arXiv Detail & Related papers (2020-04-09T01:03:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.