A Weakly-Supervised Streaming Multilingual Speech Model with Truly
Zero-Shot Capability
- URL: http://arxiv.org/abs/2211.02499v2
- Date: Wed, 5 Jul 2023 22:30:53 GMT
- Title: A Weakly-Supervised Streaming Multilingual Speech Model with Truly
Zero-Shot Capability
- Authors: Jian Xue, Peidong Wang, Jinyu Li, Eric Sun
- Abstract summary: Streaming Multilingual Speech Model (SM2) can transcribe or translate multiple spoken languages into texts of the target language.
SM2 models are trained using weakly supervised data generated by converting the transcriptions in speech recognition corpora with a machine translation service.
With 351 thousand hours of anonymized speech training data from 25 languages, SM2 models achieve comparable or even better ST quality than some recent popular large-scale non-streaming speech models.
- Score: 32.524627464837636
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we introduce our work of building a Streaming Multilingual
Speech Model (SM2), which can transcribe or translate multiple spoken languages
into texts of the target language. The backbone of SM2 is Transformer
Transducer, which has high streaming capability. Instead of human labeled
speech translation (ST) data, SM2 models are trained using weakly supervised
data generated by converting the transcriptions in speech recognition corpora
with a machine translation service. With 351 thousand hours of anonymized
speech training data from 25 languages, SM2 models achieve comparable or even
better ST quality than some recent popular large-scale non-streaming speech
models. More importantly, we show that SM2 has the truly zero-shot capability
when expanding to new target languages, yielding high quality ST results for
{source-speech, target-text} pairs that are not seen during training.
Related papers
- GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement [36.29371629234269]
GigaSpeech 2 is a large-scale, multi-domain, multilingual speech recognition corpus.
It comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese.
arXiv Detail & Related papers (2024-06-17T13:44:20Z) - MSLM-S2ST: A Multitask Speech Language Model for Textless Speech-to-Speech Translation with Speaker Style Preservation [45.558316325252335]
Multitask Speech Language Model (MSLM) is a decoder-only speech language model trained in a multitask setting.
Our model is able to support multilingual S2ST with speaker style preserved.
arXiv Detail & Related papers (2024-03-19T03:35:20Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - SeamlessM4T: Massively Multilingual & Multimodal Machine Translation [90.71078166159295]
We introduce SeamlessM4T, a single model that supports speech-to-speech translation, speech-to-text translation, text-to-text translation, and automatic speech recognition for up to 100 languages.
We developed the first multilingual system capable of translating from and into English for both speech and text.
On FLEURS, SeamlessM4T sets a new standard for translations into multiple target languages, achieving an improvement of 20% BLEU over the previous SOTA in direct speech-to-text translation.
arXiv Detail & Related papers (2023-08-22T17:44:18Z) - Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation.
By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech.
We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - MParrotTTS: Multilingual Multi-speaker Text to Speech Synthesis in Low
Resource Setting [16.37243395952266]
MParrotTTS is a unified multilingual, multi-speaker text-to-speech (TTS) synthesis model.
It adapts to a new language with minimal supervised data and generalizes to languages not seen while training the self-supervised backbone.
We present extensive results on six languages in terms of speech naturalness and speaker similarity in parallel and cross-lingual synthesis.
arXiv Detail & Related papers (2023-05-19T13:43:36Z) - Joint Pre-Training with Speech and Bilingual Text for Direct Speech to
Speech Translation [94.80029087828888]
Direct speech-to-speech translation (S2ST) is an attractive research topic with many advantages compared to cascaded S2ST.
Direct S2ST suffers from the data scarcity problem because the corpora from speech of the source language to speech of the target language are very rare.
We propose in this paper a Speech2S model, which is jointly pre-trained with unpaired speech and bilingual text data for direct speech-to-speech translation tasks.
arXiv Detail & Related papers (2022-10-31T02:55:51Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.