MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation
- URL: http://arxiv.org/abs/2010.11445v2
- Date: Mon, 8 Feb 2021 20:36:39 GMT
- Title: MAM: Masked Acoustic Modeling for End-to-End Speech-to-Text Translation
- Authors: Junkun Chen, Mingbo Ma, Renjie Zheng, Liang Huang
- Abstract summary: We propose a technique to learn a robust speech encoder in a self-supervised fashion only on the speech side.
This technique termed Masked Acoustic Modeling (MAM) not only provides an alternative solution to improving E2E-ST, but also can perform pre-training on any acoustic signals.
In the setting without using any transcriptions, our technique achieves an average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
- Score: 27.19320167337675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end Speech-to-text Translation (E2E-ST), which directly translates
source language speech to target language text, is widely useful in practice,
but traditional cascaded approaches (ASR+MT) often suffer from error
propagation in the pipeline. On the other hand, existing end-to-end solutions
heavily depend on the source language transcriptions for pre-training or
multi-task training with Automatic Speech Recognition (ASR). We instead propose
a simple technique to learn a robust speech encoder in a self-supervised
fashion only on the speech side, which can utilize speech data without
transcription. This technique termed Masked Acoustic Modeling (MAM), not only
provides an alternative solution to improving E2E-ST, but also can perform
pre-training on any acoustic signals (including non-speech ones) without
annotation. We conduct our experiments over 8 different translation directions.
In the setting without using any transcriptions, our technique achieves an
average improvement of +1.1 BLEU, and +2.3 BLEU with MAM pre-training.
Pre-training of MAM with arbitrary acoustic signals also has an average
improvement with +1.6 BLEU for those languages. Compared with ASR multi-task
learning solution, which replies on transcription during training, our
pre-trained MAM model, which does not use transcription, achieves similar
accuracy.
Related papers
- Acquiring Pronunciation Knowledge from Transcribed Speech Audio via Multi-task Learning [23.907448315388294]
We propose an alternative method to leverage transcribed speech audio as an additional training source, based on multi-task learning (MTL)
Experiments show that, compared to a baseline MTL-based method, the proposed MTL-based method reduces PER from 2.5% to 1.6% for those word types covered exclusively in transcribed speech audio.
arXiv Detail & Related papers (2024-09-15T23:00:54Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Monolingual Recognizers Fusion for Code-switching Speech Recognition [43.38810173824711]
We propose a monolingual recognizers fusion method for CS ASR.
It has two stages: the speech awareness stage and the language fusion stage.
Experiments on a Mandarin-English corpus show the efficiency of the proposed method.
arXiv Detail & Related papers (2022-11-02T11:24:26Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Enhanced Direct Speech-to-Speech Translation Using Self-supervised
Pre-training and Data Augmentation [76.13334392868208]
Direct speech-to-speech translation (S2ST) models suffer from data scarcity issues.
In this work, we explore self-supervised pre-training with unlabeled speech data and data augmentation to tackle this issue.
arXiv Detail & Related papers (2022-04-06T17:59:22Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.