Related papers: Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition

URL: http://arxiv.org/abs/2101.06699v2
Date: Sun, 24 Jan 2021 13:27:57 GMT
Title: Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition
Authors: Cheng Yi, Shiyu Zhou, Bo Xu
Abstract summary: In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
Score: 9.732767611907068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: End-to-end models have achieved impressive results on the task of automatic speech recognition (ASR). For low-resource ASR tasks, however, labeled data can hardly satisfy the demand of end-to-end models. Self-supervised acoustic pre-training has already shown its amazing ASR performance, while the transcription is still inadequate for language modeling in end-to-end models. In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. The fused model only needs to learn the transfer from speech to language during fine-tuning on limited labeled data. The length of the two modalities is matched by a monotonic attention mechanism without additional parameters. Besides, a fully connected layer is introduced for the hidden mapping between modalities. We further propose a scheduled fine-tuning strategy to preserve and utilize the text context modeling ability of the pre-trained linguistic encoder. Experiments show our effective utilizing of pre-trained modules. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.

Related papers

Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR. The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z)
Transfer Learning from Pre-trained Language Models Improves End-to-End Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model. Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences. We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data. We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task. This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z)
A Complementary Joint Training Approach Using Unpaired Speech and Text for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model. Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z)
Knowledge Transfer from Large-scale Pretrained Language Models to End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances. This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z)
Speech Summarization using Restricted Self-Attention [79.89680891246827]
We introduce a single model optimized end-to-end for speech summarization. We demonstrate that the proposed model learns to directly summarize speech for the How-2 corpus of instructional videos.
arXiv Detail & Related papers (2021-10-12T18:21:23Z)
Semi-Supervised Spoken Language Understanding via Self-Supervised Speech and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech. Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT. In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z)
Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time. We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech. Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.