Improving CTC-based speech recognition via knowledge transferring from
pre-trained language models
- URL: http://arxiv.org/abs/2203.03582v1
- Date: Tue, 22 Feb 2022 11:30:55 GMT
- Title: Improving CTC-based speech recognition via knowledge transferring from
pre-trained language models
- Authors: Keqi Deng, Songjun Cao, Yike Zhang, Long Ma, Gaofeng Cheng, Ji Xu,
Pengyuan Zhang
- Abstract summary: We propose two knowledge transferring methods to improve CTC-based models.
The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target.
The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture.
- Score: 30.599901925058873
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, end-to-end automatic speech recognition models based on
connectionist temporal classification (CTC) have achieved impressive results,
especially when fine-tuned from wav2vec2.0 models. Due to the conditional
independence assumption, CTC-based models are always weaker than
attention-based encoder-decoder models and require the assistance of external
language models (LMs). To solve this issue, we propose two knowledge
transferring methods that leverage pre-trained LMs, such as BERT and GPT2, to
improve CTC-based models. The first method is based on representation learning,
in which the CTC-based models use the representation produced by BERT as an
auxiliary learning target. The second method is based on joint classification
learning, which combines GPT2 for text modeling with a hybrid CTC/attention
architecture. Experiment on AISHELL-1 corpus yields a character error rate
(CER) of 4.2% on the test set. When compared to the vanilla CTC-based models
fine-tuned from the wav2vec2.0 models, our knowledge transferring method
reduces CER by 16.1% relatively without external LMs.
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Co-training and Co-distillation for Quality Improvement and Compression
of Language Models [88.94539115180919]
Knowledge Distillation (KD) compresses expensive pre-trained language models (PLMs) by transferring their knowledge to smaller models.
Most smaller models fail to surpass the performance of the original larger model, resulting in sacrificing performance to improve inference speed.
We propose Co-Training and Co-Distillation (CTCD), a novel framework that improves performance and inference speed together by co-training two models.
arXiv Detail & Related papers (2023-11-06T03:29:00Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - A context-aware knowledge transferring strategy for CTC-based ASR [9.500518278458905]
Methods based on the connectionist temporal classification (CTC) are still a dominating stream.
We propose a context-aware knowledge transferring strategy, consisting of a knowledge transferring module and a context-aware training strategy, for CTC-based ASR.
A knowledge-injected context-aware CTC-based ASR built upon the wav2vec2.0 is presented in this paper.
arXiv Detail & Related papers (2022-10-12T14:31:38Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Improving CTC-based ASR Models with Gated Interlayer Collaboration [9.930655347717932]
We present a Gated Interlayer Collaboration mechanism which introduces contextual information into the models.
We train the model with intermediate CTC losses calculated by the interlayer outputs of the model, in which the probability distributions of the intermediate layers naturally serve as soft label sequences.
arXiv Detail & Related papers (2022-05-25T03:21:27Z) - A Complementary Joint Training Approach Using Unpaired Speech and Text
for Low-Resource Automatic Speech Recognition [25.473191378558138]
We leverage unpaired data to train a general sequence-to-sequence model.
Inspired by the complementarity of speech-PseudoLabel pair and SynthesizedAudio-text pair, we propose a complementary joint training(CJT) method.
arXiv Detail & Related papers (2022-04-05T07:02:53Z) - Improving Hybrid CTC/Attention End-to-end Speech Recognition with
Pretrained Acoustic and Language Model [4.490054848527943]
We propose a pretrained Transformer (Preformer) S2S ASR architecture based on hybrid CTC/attention E2E models.
To the best of our knowledge, this is the first work to utilize both pretrained AM and LM in a S2S ASR system.
arXiv Detail & Related papers (2021-12-14T09:38:31Z) - Combining Unsupervised and Text Augmented Semi-Supervised Learning for
Low Resourced Autoregressive Speech Recognition [7.067186994804316]
We pretrain state-of-the-art Conformer models in an unsupervised manner.
Additional text data is incorporated through external language models.
Final performance is an additional 2% better absolute when using CTC-based decoding for semi-supervised training.
arXiv Detail & Related papers (2021-10-29T14:59:18Z) - A Study on Effects of Implicit and Explicit Language Model Information
for DBLSTM-CTC Based Handwriting Recognition [51.36957172200015]
We study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition.
Even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful.
arXiv Detail & Related papers (2020-07-31T08:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.