Improving CTC-based ASR Models with Gated Interlayer Collaboration
- URL: http://arxiv.org/abs/2205.12462v1
- Date: Wed, 25 May 2022 03:21:27 GMT
- Title: Improving CTC-based ASR Models with Gated Interlayer Collaboration
- Authors: Yuting Yang, Yuke Li, Binbin Du
- Abstract summary: We present a Gated Interlayer Collaboration mechanism which introduces contextual information into the models.
We train the model with intermediate CTC losses calculated by the interlayer outputs of the model, in which the probability distributions of the intermediate layers naturally serve as soft label sequences.
- Score: 9.930655347717932
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: For Automatic Speech Recognition (ASR), the CTC-based methods have become a
dominant paradigm due to its simple architecture and efficient
non-autoregressive inference manner. However, these methods without external
language models usually lack the capacity of modeling the conditional
dependencies and the textual interaction. In this work, we present a Gated
Interlayer Collaboration (GIC) mechanism which introduces the contextual
information into the models and relaxes the conditional independence assumption
of the CTC-based models. Specifically, we train the model with intermediate CTC
losses calculated by the interlayer outputs of the model, in which the
probability distributions of the intermediate layers naturally serve as soft
label sequences. The GIC block consists of an embedding layer to obtain the
textual embedding of the soft label at each position, and a gate unit to fuse
the textual embedding and the acoustic features. Experiments on AISHELL-1 and
AIDATATANG benchmarks show that the proposed method outperforms the recently
published CTC-based ASR models. Specifically, our method achieves CER of
4.0%/4.4% on AISHELL-1 dev/test sets and CER of 3.8%/4.4% on AIDATATANG
dev/test sets using CTC greedy search decoding without external language
models.
Related papers
- Language Models as Zero-shot Lossless Gradient Compressors: Towards
General Neural Parameter Prior Models [66.1595537904019]
Large language models (LLMs) can act as gradient priors in a zero-shot setting.
We introduce LM-GC, a novel method that integrates LLMs with arithmetic coding.
arXiv Detail & Related papers (2024-09-26T13:38:33Z) - High-Performance Few-Shot Segmentation with Foundation Models: An Empirical Study [64.06777376676513]
We develop a few-shot segmentation (FSS) framework based on foundation models.
To be specific, we propose a simple approach to extract implicit knowledge from foundation models to construct coarse correspondence.
Experiments on two widely used datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-10T08:04:11Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Latent Semantic Consensus For Deterministic Geometric Model Fitting [109.44565542031384]
We propose an effective method called Latent Semantic Consensus (LSC)
LSC formulates the model fitting problem into two latent semantic spaces based on data points and model hypotheses.
LSC is able to provide consistent and reliable solutions within only a few milliseconds for general multi-structural model fitting.
arXiv Detail & Related papers (2024-03-11T05:35:38Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR
with Internal Language Model Estimation [14.840612036671734]
Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models.
We propose a novel ILME technique for CTC-based ASR models.
Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM.
arXiv Detail & Related papers (2023-05-05T20:35:42Z) - InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss [43.39035144463951]
Momentum PL (MPL) trains a connectionist temporal classification ( CTC)-based model on unlabeled data.
CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels.
We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling.
arXiv Detail & Related papers (2022-11-02T00:18:25Z) - Improving CTC-based speech recognition via knowledge transferring from
pre-trained language models [30.599901925058873]
We propose two knowledge transferring methods to improve CTC-based models.
The first method is based on representation learning, in which the CTC-based models use the representation produced by BERT as an auxiliary learning target.
The second method is based on joint classification learning, which combines GPT2 for text modeling with a hybrid CTC/attention architecture.
arXiv Detail & Related papers (2022-02-22T11:30:55Z) - Relaxing the Conditional Independence Assumption of CTC-based ASR by
Conditioning on Intermediate Predictions [14.376418789524783]
We train a CTC-based ASR model with auxiliary CTC losses in intermediate layers in addition to the original CTC loss in the last layer.
Our method is easy to implement and retains the merits of CTC-based ASR: a simple model architecture and fast decoding speed.
arXiv Detail & Related papers (2021-04-06T18:00:03Z) - Autoregressive Score Matching [113.4502004812927]
We propose autoregressive conditional score models (AR-CSM) where we parameterize the joint distribution in terms of the derivatives of univariable log-conditionals (scores)
For AR-CSM models, this divergence between data and model distributions can be computed and optimized efficiently, requiring no expensive sampling or adversarial training.
We show with extensive experimental results that it can be applied to density estimation on synthetic data, image generation, image denoising, and training latent variable models with implicit encoders.
arXiv Detail & Related papers (2020-10-24T07:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.