MAESTRO: Matched Speech Text Representations through Modality Matching
- URL: http://arxiv.org/abs/2204.03409v1
- Date: Thu, 7 Apr 2022 12:48:16 GMT
- Title: MAESTRO: Matched Speech Text Representations through Modality Matching
- Authors: Zhehuai Chen, Yu Zhang, Andrew Rosenberg, Bhuvana Ramabhadran, Pedro
Moreno, Ankur Bapna, Heiga Zen
- Abstract summary: Maestro is a self-supervised training method to unify representations learnt from speech and text modalities.
We establish a new state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative reduction in Word Error Rate (WER)
We establish a new state-of-the-art (SOTA) on CoVoST 2 with an improvement of 2.8 BLEU averaged over 21 languages.
- Score: 35.566604806335626
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Maestro, a self-supervised training method to unify
representations learnt from speech and text modalities. Self-supervised
learning from speech signals aims to learn the latent structure inherent in the
signal, while self-supervised learning from text attempts to capture lexical
information. Learning aligned representations from unpaired speech and text
sequences is a challenging task. Previous work either implicitly enforced the
representations learnt from these two modalities to be aligned in the latent
space through multitasking and parameter sharing or explicitly through
conversion of modalities via speech synthesis. While the former suffers from
interference between the two modalities, the latter introduces additional
complexity. In this paper, we propose Maestro, a novel algorithm to learn
unified representations from both these modalities simultaneously that can
transfer to diverse downstream tasks such as Automated Speech Recognition (ASR)
and Speech Translation (ST). Maestro learns unified representations through
sequence alignment, duration prediction and matching embeddings in the learned
space through an aligned masked-language model loss. We establish a new
state-of-the-art (SOTA) on VoxPopuli multilingual ASR with a 11% relative
reduction in Word Error Rate (WER), multidomain SpeechStew ASR (3.7% relative)
and 21 languages to English multilingual ST on CoVoST 2 with an improvement of
2.8 BLEU averaged over 21 languages.
Related papers
- Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - ComSL: A Composite Speech-Language Model for End-to-End Speech-to-Text
Translation [79.66359274050885]
We present ComSL, a speech-language model built atop a composite architecture of public pretrained speech-only and language-only models.
Our approach has demonstrated effectiveness in end-to-end speech-to-text translation tasks.
arXiv Detail & Related papers (2023-05-24T07:42:15Z) - Speech-text based multi-modal training with bidirectional attention for
improved speech recognition [26.47071418582507]
We propose to employ a novel bidirectional attention mechanism (BiAM) to jointly learn both ASR encoder (bottom layers) and text encoder with a multi-modal learning method.
BiAM is to facilitate feature sampling rate exchange, realizing the quality of the transformed features for the one kind to be measured in another space.
Experimental results on Librispeech corpus show it can achieve up to 6.15% word error rate reduction (WERR) with only paired data learning, while 9.23% WERR when more unpaired text data is employed.
arXiv Detail & Related papers (2022-11-01T08:25:11Z) - Unify and Conquer: How Phonetic Feature Representation Affects Polyglot
Text-To-Speech (TTS) [3.57486761615991]
unified representations consistently achieves better cross-lingual synthesis with respect to both naturalness and accent.
Separate representations tend to have an order of magnitude more tokens than unified ones, which may affect model capacity.
arXiv Detail & Related papers (2022-07-04T16:14:57Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Unified Speech-Text Pre-training for Speech Translation and Recognition [113.31415771943162]
We describe a method to jointly pre-train speech and text in an encoder-decoder modeling framework for speech translation and recognition.
The proposed method incorporates four self-supervised and supervised subtasks for cross modality learning.
It achieves between 1.7 and 2.3 BLEU improvement above the state of the art on the MuST-C speech translation dataset.
arXiv Detail & Related papers (2022-04-11T20:59:51Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z) - A General Multi-Task Learning Framework to Leverage Text Data for Speech
to Text Tasks [36.216979991706594]
We propose a general multi-task learning framework to leverage text data for automatic speech recognition (ASR) and speech translation (ST) tasks.
We demonstrate that representing text input as phoneme sequences can reduce the difference between speech and text inputs, and enhance the knowledge transfer from text corpora to the speech to text tasks.
arXiv Detail & Related papers (2020-10-21T22:40:43Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.