Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model
- URL: http://arxiv.org/abs/2112.07254v1
- Date: Tue, 14 Dec 2021 09:38:31 GMT
- Title: Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model
- Authors: Keqi Deng, Songjun Cao, Yike Zhang, Long Ma
- Abstract summary: We propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models.
To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
- Score: 4.490054848527943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-supervised pretraining has achieved impressive results in
end-to-end (E2E) automatic speech recognition (ASR). However, the dominant
sequence-to-sequence (S2S) E2E model is still hard to fully utilize the
self-supervised pre-training methods because its decoder is conditioned on
acoustic representation thus cannot be pretrained separately. In this paper, we
propose a pretrained Transformer (Preformer) S2S ASR architecture based on
hybrid CTC/attention E2E models to fully utilize the pretrained acoustic models
(AMs) and language models (LMs). In our framework, the encoder is initialized
with a pretrained AM (wav2vec2.0). The Preformer leverages CTC as an auxiliary
task during training and inference. Furthermore, we design a one-cross decoder
(OCD), which relaxes the dependence on acoustic representations so that it can
be initialized with pretrained LM (DistilGPT2). Experiments are conducted on
the AISHELL-1 corpus and achieve a $4.6\%$ character error rate (CER) on the
test set. Compared with our vanilla hybrid CTC/attention Transformer baseline,
our proposed CTC/attention-based Preformer yields $27\%$ relative CER
reduction. To the best of our knowledge, this is the first work to utilize both
pretrained AM and LM in a S2S ASR system.
Related papers
- Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Pre-training for Speech Translation: CTC Meets Optimal Transport [29.807861658249923]
We show that the connectionist temporal classification (CTC) loss can reduce the modality gap by design.
We propose a novel pre-training method combining CTC and optimal transport to further reduce this gap.
Our method pre-trains a Siamese-like model composed of two encoders, one for acoustic inputs and the other for textual inputs, such that they produce representations that are close to each other in the Wasserstein space.
arXiv Detail & Related papers (2023-01-27T14:03:09Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - The THUEE System Description for the IARPA OpenASR21 Challenge [12.458730613670316]
This paper describes the THUEE team's speech recognition system for the IARPA Open Automatic Speech Recognition Challenge (OpenASR21)
We achieve outstanding results under both the Constrained and Constrained-plus training conditions.
We find that the feature extractor plays an important role when applying the wav2vec2.0 pre-trained model to the encoder-decoder based CTC/Attention ASR architecture.
arXiv Detail & Related papers (2022-06-29T14:03:05Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for
Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model.
Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.