Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders
- URL: http://arxiv.org/abs/2105.05752v1
- Date: Wed, 12 May 2021 16:09:53 GMT
- Title: Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained
Models into Speech Translation Encoders
- Authors: Chen Xu, Bojie Hu, Yanyang Li, Yuhao Zhang, shen huang, Qi Ju, Tong
Xiao, Jingbo Zhu
- Abstract summary: Speech-to-translation data is scarce; pre-training is promising in end-to-end Speech Translation.
We propose a Stacked.
Acoustic-and-Textual (SATE) method for speech translation.
Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an.
MT encoder for a global representation of the input sequence.
- Score: 30.160261563657947
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoder pre-training is promising in end-to-end Speech Translation (ST),
given the fact that speech-to-translation data is scarce. But ST encoders are
not simple instances of Automatic Speech Recognition (ASR) or Machine
Translation (MT) encoders. For example, we find ASR encoders lack the global
context representation, which is necessary for translation, whereas MT encoders
are not designed to deal with long but locally attentive acoustic sequences. In
this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for
speech translation. Our encoder begins with processing the acoustic sequence as
usual, but later behaves more like an MT encoder for a global representation of
the input sequence. In this way, it is straightforward to incorporate the
pre-trained models into the system. Also, we develop an adaptor module to
alleviate the representation inconsistency between the pre-trained ASR encoder
and MT encoder, and a multi-teacher knowledge distillation method to preserve
the pre-training knowledge. Experimental results on the LibriSpeech En-Fr and
MuST-C En-De show that our method achieves the state-of-the-art performance of
18.3 and 25.2 BLEU points. To our knowledge, we are the first to develop an
end-to-end ST system that achieves comparable or even better BLEU performance
than the cascaded ST counterpart when large-scale ASR and MT data is available.
Related papers
- Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z) - Hybrid Transducer and Attention based Encoder-Decoder Modeling for
Speech-to-Text Tasks [28.440232737011453]
We propose a solution by combining Transducer and Attention based AED-Decoder (TAED) for speech-to-text tasks.
The new method leverages Transducer's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property.
We evaluate the proposed approach on the textscMuST-C dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks.
arXiv Detail & Related papers (2023-05-04T18:34:50Z) - Linguistic-Enhanced Transformer with CTC Embedding for Speech
Recognition [29.1423215212174]
Recent emergence of joint CTC-Attention model shows significant improvement in automatic speech recognition (ASR)
We propose linguistic-enhanced transformer, which introduces refined CTC information to decoder during training process.
Experiments on AISHELL-1 speech corpus show that the character error rate (CER) is relatively reduced by up to 7%.
arXiv Detail & Related papers (2022-10-25T08:12:59Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - M-Adapter: Modality Adaptation for End-to-End Speech-to-Text Translation [66.92823764664206]
We propose M-Adapter, a novel Transformer-based module, to adapt speech representations to text.
While shrinking the speech sequence, M-Adapter produces features desired for speech-to-text translation.
Our experimental results show that our model outperforms a strong baseline by up to 1 BLEU.
arXiv Detail & Related papers (2022-07-03T04:26:53Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - ConvFiT: Conversational Fine-Tuning of Pretrained Language Models [42.7160113690317]
Transformer-based language models (LMs) pretrained on large text collections are proven to store a wealth of semantic knowledge.
We propose ConvFiT, a simple and efficient two-stage procedure which turns any pretrained LM into a universal conversational encoder.
arXiv Detail & Related papers (2021-09-21T12:16:56Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.