Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models
- URL: http://arxiv.org/abs/2212.09553v2
- Date: Tue, 27 Jun 2023 01:18:45 GMT
- Title: Mu$^{2}$SLAM: Multitask, Multilingual Speech and Language Models
- Authors: Yong Cheng, Yu Zhang, Melvin Johnson, Wolfgang Macherey, Ankur Bapna
- Abstract summary: We present Mu$2$SLAM, a multilingual sequence-to-sequence model pre-trained jointly on unlabeled speech, unlabeled text and supervised data in over 100 languages.
By leveraging a quantized representation of speech as a target, Mu$2$SLAM trains the speech-text models with a sequence-to-sequence masked denoising objective similar to T5 on the decoder and a masked language modeling (MLM) objective on the encoder.
On Voxpopuli ASR, our model matches the performance of an mSLAM model fine-tuned with an RNN-T decoder
- Score: 37.44999077096415
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Mu$^{2}$SLAM, a multilingual sequence-to-sequence model
pre-trained jointly on unlabeled speech, unlabeled text and supervised data
spanning Automatic Speech Recognition (ASR), Automatic Speech Translation (AST)
and Machine Translation (MT), in over 100 languages. By leveraging a quantized
representation of speech as a target, Mu$^{2}$SLAM trains the speech-text
models with a sequence-to-sequence masked denoising objective similar to T5 on
the decoder and a masked language modeling (MLM) objective on the encoder, for
both unlabeled speech and text, while utilizing the supervised tasks to improve
cross-lingual and cross-modal representation alignment within the model. On
CoVoST AST, Mu$^{2}$SLAM establishes a new state-of-the-art for models trained
on public datasets, improving on xx-en translation over the previous best by
1.9 BLEU points and on en-xx translation by 1.1 BLEU points. On Voxpopuli ASR,
our model matches the performance of an mSLAM model fine-tuned with an RNN-T
decoder, despite using a relatively weaker sequence-to-sequence architecture.
On text understanding tasks, our model improves by more than 6\% over mSLAM on
XNLI, getting closer to the performance of mT5 models of comparable capacity on
XNLI and TydiQA, paving the way towards a single model for all speech and text
understanding tasks.
Related papers
- SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units.
Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering.
SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z) - Chain-of-Thought Prompting for Speech Translation [33.77037760225061]
Large language models (LLMs) have demonstrated remarkable advancements in language understanding and generation.
Recent research has adapted these models to use speech embeddings for prompting, resulting in Speech-LLM models that exhibit strong performance.
We propose a novel approach to leverage ASR transcripts as prompts for AST in a Speech-LLM built on an encoder-decoder text LLM.
arXiv Detail & Related papers (2024-09-17T20:16:43Z) - Investigating Decoder-only Large Language Models for Speech-to-text Translation [39.17113782374464]
Large language models (LLMs) are known for their exceptional reasoning capabilities, generalizability, and fluency across diverse domains.
We propose a decoder-only architecture that enables the LLM to directly consume the encoded speech representation and generate the text translation.
Our model achieves state-of-the-art performance on CoVoST 2 and FLEURS among models trained without proprietary data.
arXiv Detail & Related papers (2024-07-03T14:42:49Z) - A Text-to-Text Model for Multilingual Offensive Language Identification [19.23565690468299]
This study presents the first pre-trained model with encoder-decoder architecture for offensive language identification with text-to-text transformers (T5)
Our pre-trained T5 model outperforms other transformer-based models fine-tuned for offensive language detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained model for offensive language identification using mT5.
arXiv Detail & Related papers (2023-12-06T09:37:27Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - Extrapolating Multilingual Understanding Models as Multilingual
Generators [82.1355802012414]
This paper explores methods to empower multilingual understanding models the generation abilities to get a unified model.
We propose a textbfSemantic-textbfGuided textbfAlignment-then-Denoising (SGA) approach to adapt an encoder to a multilingual generator with a small number of new parameters.
arXiv Detail & Related papers (2023-05-22T15:33:21Z) - Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages [76.95115818308918]
We introduce the Universal Speech Model (USM), a single large model that performs automatic speech recognition (ASR) across 100+ languages.
This is achieved by pre-training the encoder of the model on a large unlabeled multilingual dataset of 12 million (M) hours spanning over 300 languages.
We use multilingual pre-training with random-projection quantization and speech-text modality matching to achieve state-of-the-art performance on downstream multilingual ASR and speech-to-text translation tasks.
arXiv Detail & Related papers (2023-03-02T07:47:18Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.