Learning from Flawed Data: Weakly Supervised Automatic Speech
Recognition
- URL: http://arxiv.org/abs/2309.15796v1
- Date: Tue, 26 Sep 2023 12:58:40 GMT
- Title: Learning from Flawed Data: Weakly Supervised Automatic Speech
Recognition
- Authors: Dongji Gao, Hainan Xu, Desh Raj, Leibny Paola Garcia Perera, Daniel
Povey, Sanjeev Khudanpur
- Abstract summary: Training automatic speech recognition (ASR) systems requires large amounts of well-curated paired data.
Human annotators usually perform "non-verbatim" transcription, which can result in poorly trained models.
We propose Omni-temporal Classification (OTC), a novel training criterion that explicitly incorporates label uncertainties.
- Score: 30.544499309503863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Training automatic speech recognition (ASR) systems requires large amounts of
well-curated paired data. However, human annotators usually perform
"non-verbatim" transcription, which can result in poorly trained models. In
this paper, we propose Omni-temporal Classification (OTC), a novel training
criterion that explicitly incorporates label uncertainties originating from
such weak supervision. This allows the model to effectively learn speech-text
alignments while accommodating errors present in the training transcripts. OTC
extends the conventional CTC objective for imperfect transcripts by leveraging
weighted finite state transducers. Through experiments conducted on the
LibriSpeech and LibriVox datasets, we demonstrate that training ASR models with
OTC avoids performance degradation even with transcripts containing up to 70%
errors, a scenario where CTC models fail completely. Our implementation is
available at https://github.com/k2-fsa/icefall.
Related papers
- Spelling Correction through Rewriting of Non-Autoregressive ASR Lattices [8.77712061194924]
We present a finite-state transducer (FST) technique for rewriting wordpiece lattices generated by Transformer-based CTC models.
Our algorithm performs grapheme-to-phoneme (G2P) conversion directly from wordpieces into phonemes, avoiding explicit word representations.
We achieved up to a 15.2% relative reduction in sentence error rate (SER) on a test set with contextually relevant entities.
arXiv Detail & Related papers (2024-09-24T21:42:25Z) - Bypass Temporal Classification: Weakly Supervised Automatic Speech
Recognition with Imperfect Transcripts [44.16141704545044]
We present a novel algorithm for building an automatic speech recognition (ASR) model with imperfect training data.
The proposed algorithm improves the robustness and accuracy of ASR systems, particularly when working with imprecisely transcribed speech corpora.
arXiv Detail & Related papers (2023-06-01T14:56:19Z) - Unsupervised Pre-Training For Data-Efficient Text-to-Speech On Low
Resource Languages [15.32264927462068]
We propose an unsupervised pre-training method for a sequence-to-sequence TTS model by leveraging large untranscribed speech data.
The main idea is to pre-train the model to reconstruct de-warped mel-spectrograms from warped ones.
We empirically demonstrate the effectiveness of our proposed method in low-resource language scenarios.
arXiv Detail & Related papers (2023-03-28T01:26:00Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Semi-Supervised Spoken Language Understanding via Self-Supervised Speech
and Language Model Pretraining [64.35907499990455]
We propose a framework to learn semantics directly from speech with semi-supervision from transcribed or untranscribed speech.
Our framework is built upon pretrained end-to-end (E2E) ASR and self-supervised language models, such as BERT.
In parallel, we identify two essential criteria for evaluating SLU models: environmental noise-robustness and E2E semantics evaluation.
arXiv Detail & Related papers (2020-10-26T18:21:27Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.