Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models
- URL: http://arxiv.org/abs/2305.15405v2
- Date: Tue, 20 Feb 2024 18:55:52 GMT
- Title: Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models
- Authors: Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
- Abstract summary: We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems.
We finetune S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
- Score: 56.1058530241461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing speech-to-speech translation models fall into two camps: textless
models trained with hundreds of hours of parallel speech data or unsupervised
models that leverage text as an intermediate step. Both approaches limit
building speech-to-speech translation models for a wide range of languages, as
they exclude languages that are primarily spoken and language pairs that lack
large-scale parallel speech data. We present a new framework for training
textless low-resource speech-to-speech translation (S2ST) systems that only
need dozens of hours of parallel speech data. We reformulate S2ST as a
unit-to-unit seq2seq translation task, and start by pretraining a model on
large-scale monolingual speech data. Then, we finetune it with a small amount
of parallel speech data ($20-60$ hours). Lastly, we improve model performance
through an unsupervised backtranslation objective. We train and evaluate our
models for English-to-German, German-to-English and Marathi-to-English
translation on three different domains (European Parliament, Common Voice, and
All India Radio) with single-speaker synthesized speech data. Evaluated using
the ASR-BLEU metric, our models achieve reasonable performance on all three
domains, with some being within 1-2 points of our supervised topline.
Related papers
- Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data? [49.42189569058647]
Two-pass direct speech-to-speech translation (S2ST) models decompose the task into speech-to-text translation (S2TT) and text-to-speech (TTS)
In this paper, we introduce a composite S2ST model named ComSpeech, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
We also propose a novel training method ComSpeech-ZS that solely utilizes S2TT and TTS data.
arXiv Detail & Related papers (2024-06-11T14:17:12Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.