Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR
with Internal Language Model Estimation
- URL: http://arxiv.org/abs/2305.03837v1
- Date: Fri, 5 May 2023 20:35:42 GMT
- Title: Mask The Bias: Improving Domain-Adaptive Generalization of CTC-based ASR
with Internal Language Model Estimation
- Authors: Nilaksh Das, Monica Sunkara, Sravan Bodapati, Jinglun Cai, Devang
Kulshreshtha, Jeff Farris, Katrin Kirchhoff
- Abstract summary: Internal language model estimation (ILME) has been proposed to mitigate this bias for autoregressive models.
We propose a novel ILME technique for CTC-based ASR models.
Our method iteratively masks the audio timesteps to estimate a pseudo log-likelihood of the internal LM.
- Score: 14.840612036671734
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: End-to-end ASR models trained on large amount of data tend to be implicitly
biased towards language semantics of the training data. Internal language model
estimation (ILME) has been proposed to mitigate this bias for autoregressive
models such as attention-based encoder-decoder and RNN-T. Typically, ILME is
performed by modularizing the acoustic and language components of the model
architecture, and eliminating the acoustic input to perform log-linear
interpolation with the text-only posterior. However, for CTC-based ASR, it is
not as straightforward to decouple the model into such acoustic and language
components, as CTC log-posteriors are computed in a non-autoregressive manner.
In this work, we propose a novel ILME technique for CTC-based ASR models. Our
method iteratively masks the audio timesteps to estimate a pseudo
log-likelihood of the internal LM by accumulating log-posteriors for only the
masked timesteps. Extensive evaluation across multiple out-of-domain datasets
reveals that the proposed approach improves WER by up to 9.8% and OOV F1-score
by up to 24.6% relative to Shallow Fusion, when only text data from target
domain is available. In the case of zero-shot domain adaptation, with no access
to any target domain data, we demonstrate that removing the source domain bias
with ILME can still outperform Shallow Fusion to improve WER by up to 9.3%
relative.
Related papers
- Effective internal language model training and fusion for factorized transducer model [26.371223360905557]
Internal language model (ILM) of the neural transducer has been widely studied.
We propose a novel ILM training and decoding strategy for factorized transducer models.
arXiv Detail & Related papers (2024-04-02T08:01:05Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - Decoupled Structure for Improved Adaptability of End-to-End Models [16.195423291103975]
This paper proposes decoupled structures for attention-based encoder-decoder (Decoupled-AED) and neural transducer (Decoupled-Transducer) models.
The acoustic and linguistic parts of the E2E model decoder (or prediction network) are decoupled, making the linguistic component replaceable.
Experiments for E2E ASR models trained on the Libri-100h corpus showed that the proposed decoupled structure gave 15.1% and 17.2% relative word error rate reductions.
arXiv Detail & Related papers (2023-08-25T12:31:12Z) - IDA: Informed Domain Adaptive Semantic Segmentation [51.12107564372869]
We propose an Domain Informed Adaptation (IDA) model, a self-training framework that mixes the data based on class-level segmentation performance.
In our IDA model, the class-level performance is tracked by an expected confidence score (ECS) and we then use a dynamic schedule to determine the mixing ratio for data in different domains.
Our proposed method is able to outperform the state-of-the-art UDA-SS method by a margin of 1.1 mIoU in the adaptation of GTA-V to Cityscapes and of 0.9 mIoU in the adaptation of SYNTHIA to City
arXiv Detail & Related papers (2023-03-05T18:16:34Z) - Exploiting Temporal Structures of Cyclostationary Signals for
Data-Driven Single-Channel Source Separation [98.95383921866096]
We study the problem of single-channel source separation (SCSS)
We focus on cyclostationary signals, which are particularly suitable in a variety of application domains.
We propose a deep learning approach using a U-Net architecture, which is competitive with the minimum MSE estimator.
arXiv Detail & Related papers (2022-08-22T14:04:56Z) - Improving CTC-based ASR Models with Gated Interlayer Collaboration [9.930655347717932]
We present a Gated Interlayer Collaboration mechanism which introduces contextual information into the models.
We train the model with intermediate CTC losses calculated by the interlayer outputs of the model, in which the probability distributions of the intermediate layers naturally serve as soft label sequences.
arXiv Detail & Related papers (2022-05-25T03:21:27Z) - Investigating Methods to Improve Language Model Integration for
Attention-based Encoder-Decoder ASR Models [107.86965028729517]
Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions.
We propose several novel methods to estimate the ILM directly from the AED model.
arXiv Detail & Related papers (2021-04-12T15:16:03Z) - Internal Language Model Training for Domain-Adaptive End-to-End Speech
Recognition [83.739317674302]
Internal language model estimation (ILME) method can be used to improve integration between external language models and automatic speech recognition systems.
We propose an internal LM training (ILMT) method to minimize an additional internal LM loss.
ILMT encourages the E2E model to form a standalone LM inside its existing components, without sacrificing ASR accuracy.
arXiv Detail & Related papers (2021-02-02T08:15:02Z) - Internal Language Model Estimation for Domain-Adaptive End-to-End Speech
Recognition [56.27081731553829]
Internal language models (LM) integration is a challenging task for end-to-end (E2E) automatic speech recognition.
We propose an internal LM estimation (ILME) method to facilitate a more effective integration of the external LM with all pre-existing E2E models.
ILME can alleviate the domain mismatch between training and testing, or improve the multi-domain E2E ASR.
arXiv Detail & Related papers (2020-11-03T20:11:04Z) - A Density Ratio Approach to Language Model Fusion in End-To-End
Automatic Speech Recognition [9.184319271887531]
This article describes a density ratio approach to integrating external Language Models (LMs) into end-to-end models for Automatic Speech Recognition (ASR)
An RNN-T ASR model trained on paired audio & transcript data from YouTube is evaluated for its ability to generalize to Voice Search data.
arXiv Detail & Related papers (2020-02-26T02:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.