Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts
- URL: http://arxiv.org/abs/2306.01031v1
- Date: Thu, 1 Jun 2023 14:56:19 GMT
- Title: Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts
- Authors: Dongji Gao and Matthew Wiesner and Hainan Xu and Leibny Paola Garcia
and Daniel Povey and Sanjeev Khudanpur
- Abstract summary: We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data.
The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
- Score: 44.16141704545044
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel algorithm for building an automatic speech
recognition (ASR) model with imperfect training data. Imperfectly transcribed
speech is a prevalent issue in human-annotated speech corpora, which degrades
the performance of ASR models. To address this problem, we propose Bypass
Temporal Classification (BTC) as an expansion of the Connectionist Temporal
Classification (CTC) criterion. BTC explicitly encodes the uncertainties
associated with transcripts during training. This is accomplished by enhancing
the flexibility of the training graph, which is implemented as a weighted
finite-state transducer (WFST) composition. The proposed algorithm improves the
robustness and accuracy of ASR systems, particularly when working with
imprecisely transcribed speech corpora. Our implementation will be
open-sourced.
Related papers
- Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Transcription-Free Fine-Tuning of Speech Separation Models for Noisy and Reverberant Multi-Speaker Automatic Speech Recognition [18.50957174600796]
Solution to automatic speech recognition (ASR) of overlapping speakers is to separate speech and then perform ASR on the separated signals.
Currently, the separator produces artefacts which often degrade ASR performance.
This paper proposes a transcription-free method for joint training using only audio signals.
arXiv Detail & Related papers (2024-06-13T08:20:58Z) - Learning from Flawed Data: Weakly Supervised Automatic Speech
Recognition [30.544499309503863]
Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data.
Human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models.
We propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties.
arXiv Detail & Related papers (2023-09-26T12:58:40Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Fast End-to-End Speech Recognition via a Non-Autoregressive Model and
Cross-Modal Knowledge Transferring from BERT [72.93855288283059]
We propose a non-autoregressive speech recognition model called LASO (Listen Attentively, and Spell Once)
The model consists of an encoder, a decoder, and a position dependent summarizer (PDS)
arXiv Detail & Related papers (2021-02-15T15:18:59Z) - End-to-end speech-to-dialog-act recognition [38.58540444573232]
We present an end-to-end model which directly converts speech into dialog acts without the deterministic transcription process.
In the proposed model, the dialog act recognition network is conjunct with an acoustic-to-word ASR model at its latent layer.
The entire network is fine-tuned in an end-to-end manner.
arXiv Detail & Related papers (2020-04-23T18:44:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.