End-to-end Joint Rich and Normalized ASR with a limited amount of rich
training data
- URL: http://arxiv.org/abs/2311.17741v1
- Date: Wed, 29 Nov 2023 15:44:39 GMT
- Title: End-to-end Joint Rich and Normalized ASR with a limited amount of rich
training data
- Authors: Can Cui (MULTISPEECH), Imran Ahamad Sheikh, Mostafa Sadeghi
(MULTISPEECH), Emmanuel Vincent (MULTISPEECH)
- Abstract summary: We train a stateless Transducer-based E2E joint rich and normalized ASR system with a limited amount of rich labeled data.
The first approach leads to E2E rich ASR which perform better on out-of-domain data, with up to 9% relative reduction in errors.
The second approach demonstrates the feasibility of an E2E joint rich and normalized ASR system using as low as 5% rich training data with moderate (2.42% absolute) increase in errors.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint rich and normalized automatic speech recognition (ASR), that produces
transcriptions both with and without punctuation and capitalization, remains a
challenge. End-to-end (E2E) ASR models offer both convenience and the ability
to perform such joint transcription of speech. Training such models requires
paired speech and rich text data, which is not widely available. In this paper,
we compare two different approaches to train a stateless Transducer-based E2E
joint rich and normalized ASR system, ready for streaming applications, with a
limited amount of rich labeled data. The first approach uses a language model
to generate pseudo-rich transcriptions of normalized training data. The second
approach uses a single decoder conditioned on the type of the output. The first
approach leads to E2E rich ASR which perform better on out-of-domain data, with
up to 9% relative reduction in errors. The second approach demonstrates the
feasibility of an E2E joint rich and normalized ASR system using as low as 5%
rich training data with moderate (2.42% absolute) increase in errors.
Related papers
- Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition [21.516152600963775]
Denoising LM (DLM) is a $textitscaled$ error correction model trained with vast amounts of synthetic data.
DLM achieves 1.5% word error rate (WER) on $textittest-clean$ and 3.3% WER on $textittest-other$ on Librispeech.
arXiv Detail & Related papers (2024-05-24T05:05:12Z) - Transfer Learning from Pre-trained Language Models Improves End-to-End
Speech Summarization [48.35495352015281]
End-to-end speech summarization (E2E SSum) directly summarizes input speech into easy-to-read short sentences with a single model.
Due to the high cost of collecting speech-summary pairs, an E2E SSum model tends to suffer from training data scarcity and output unnatural sentences.
We propose for the first time to integrate a pre-trained language model (LM) into the E2E SSum decoder via transfer learning.
arXiv Detail & Related papers (2023-06-07T08:23:58Z) - Contextual Density Ratio for Language Model Biasing of Sequence to
Sequence ASR Systems [2.4909170697740963]
We propose a contextual density ratio approach for both training a contextual aware E2E model and adapting the language model to named entities.
Our proposed technique achieves a relative improvement of up to 46.5% on the names over an E2E baseline without degrading the overall recognition accuracy of the whole test set.
arXiv Detail & Related papers (2022-06-29T13:12:46Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training.
In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.