Personalization of CTC Speech Recognition Models
- URL: http://arxiv.org/abs/2210.09510v1
- Date: Tue, 18 Oct 2022 01:08:21 GMT
- Title: Personalization of CTC Speech Recognition Models
- Authors: Saket Dingliwal, Monica Sunkara, Srikanth Ronanki, Jeff Farris, Katrin
Kirchhoff, Sravan Bodapati
- Abstract summary: We propose a novel two-way approach that first biases the encoder with attention over a list of rare long-tail and out-of-vocabulary words.
We evaluate our approach on open-source VoxPopuli and in-house medical datasets to showcase a 60% improvement in F1 score on domain-specific rare words.
- Score: 15.470660345766445
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: End-to-end speech recognition models trained using joint Connectionist
Temporal Classification (CTC)-Attention loss have gained popularity recently.
In these models, a non-autoregressive CTC decoder is often used at inference
time due to its speed and simplicity. However, such models are hard to
personalize because of their conditional independence assumption that prevents
output tokens from previous time steps to influence future predictions. To
tackle this, we propose a novel two-way approach that first biases the encoder
with attention over a predefined list of rare long-tail and out-of-vocabulary
(OOV) words and then uses dynamic boosting and phone alignment network during
decoding to further bias the subword predictions. We evaluate our approach on
open-source VoxPopuli and in-house medical datasets to showcase a 60%
improvement in F1 score on domain-specific rare words over a strong CTC
baseline.
Related papers
- Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn
Medical Interview [26.823126615724888]
End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks.
We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions.
In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.
arXiv Detail & Related papers (2024-03-01T08:53:52Z) - Improved Training for End-to-End Streaming Automatic Speech Recognition
Model with Punctuation [0.08602553195689511]
We propose a method for predicting punctuated text from input speech using a chunk-based Transformer encoder trained with Connectionist Temporal Classification (CTC) loss.
By combining CTC losses on the chunks and utterances, we achieved both the improved F1 score of punctuation prediction and Word Error Rate (WER)
arXiv Detail & Related papers (2023-06-02T06:46:14Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Investigating the Reordering Capability in CTC-based Non-Autoregressive
End-to-End Speech Translation [62.943925893616196]
We study the possibilities of building a non-autoregressive speech-to-text translation model using connectionist temporal classification (CTC)
CTC's success on translation is counter-intuitive due to its monotonicity assumption, so we analyze its reordering capability.
Our analysis shows that transformer encoders have the ability to change the word order.
arXiv Detail & Related papers (2021-05-11T07:48:45Z) - A comparison of self-supervised speech representations as input features
for unsupervised acoustic word embeddings [32.59716743279858]
We look at representation learning at the short-time frame level.
Recent approaches include self-supervised predictive coding and correspondence autoencoder (CAE) models.
We compare frame-level features from contrastive predictive coding ( CPC), autoregressive predictive coding and a CAE to conventional MFCCs.
arXiv Detail & Related papers (2020-12-14T10:17:25Z) - Focus on the present: a regularization method for the ASR source-target
attention layer [45.73441417132897]
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models.
Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations.
We found that the source-target attention heads are able to predict several tokens ahead of the current one.
arXiv Detail & Related papers (2020-11-02T18:56:33Z) - End-to-End Automatic Speech Recognition Integrated With CTC-Based Voice
Activity Detection [48.80449801938696]
This paper integrates a voice activity detection function with end-to-end automatic speech recognition.
We focus on connectionist temporal classification ( CTC) and its extension ofsynchronous/attention.
We use the labels as a cue for detecting speech segments with simple thresholding.
arXiv Detail & Related papers (2020-02-03T03:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.