Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models
- URL: http://arxiv.org/abs/2305.15405v2
- Date: Tue, 20 Feb 2024 18:55:52 GMT
- Title: Textless Low-Resource Speech-to-Speech Translation With Unit Language
Models
- Authors: Anuj Diwan, Anirudh Srinivasan, David Harwath, Eunsol Choi
- Abstract summary: We present a new framework for training textless low-resource speech-to-speech translation (S2ST) systems.
We finetune S2ST as a unit-to-unit seq2seq translation task, and start by pretraining a model on large-scale monolingual speech data.
We train and evaluate our models for English-to-German, German-to-English and Marathi-to-English translation on three different domains.
- Score: 56.1058530241461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing speech-to-speech translation models fall into two camps: textless
models trained with hundreds of hours of parallel speech data or unsupervised
models that leverage text as an intermediate step. Both approaches limit
building speech-to-speech translation models for a wide range of languages, as
they exclude languages that are primarily spoken and language pairs that lack
large-scale parallel speech data. We present a new framework for training
textless low-resource speech-to-speech translation (S2ST) systems that only
need dozens of hours of parallel speech data. We reformulate S2ST as a
unit-to-unit seq2seq translation task, and start by pretraining a model on
large-scale monolingual speech data. Then, we finetune it with a small amount
of parallel speech data ($20-60$ hours). Lastly, we improve model performance
through an unsupervised backtranslation objective. We train and evaluate our
models for English-to-German, German-to-English and Marathi-to-English
translation on three different domains (European Parliament, Common Voice, and
All India Radio) with single-speaker synthesized speech data. Evaluated using
the ASR-BLEU metric, our models achieve reasonable performance on all three
domains, with some being within 1-2 points of our supervised topline.
Related papers
- MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation [45.558316325252335]
Multitask Speech Language Model (MSLM) is a decoder-only speech language model trained in a multitask setting.
Our model is able to support multilingual S2ST with speaker style preserved.
arXiv Detail & Related papers (2024-03-19T03:35:20Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - PolyVoice: Language Models for Speech to Speech Translation [50.31000706309143]
PolyVoice is a language model-based framework for speech-to-speech translation (S2ST)
We use discretized speech units, which are generated in a fully unsupervised way.
For the speech synthesis part, we adopt the existing VALL-E X approach and build a unit-based audio language model.
arXiv Detail & Related papers (2023-06-05T15:53:15Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised
Learning for Text-To-Speech [37.942466944970704]
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models.
To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets.
Experimental evaluation shows that multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages.
arXiv Detail & Related papers (2022-10-27T14:09:48Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.