End-to-End Speech Recognition: A Survey
- URL: http://arxiv.org/abs/2303.03329v1
- Date: Fri, 3 Mar 2023 01:46:41 GMT
- Title: End-to-End Speech Recognition: A Survey
- Authors: Rohit Prabhavalkar, Takaaki Hori, Tara N. Sainath, Ralf Schl\"uter,
Shinji Watanabe
- Abstract summary: The goal of this survey is to provide a taxonomy of E2E ASR models and corresponding improvements.
All relevant aspects of E2E ASR are covered in this work, accompanied by discussions of performance and deployment opportunities.
- Score: 68.35707678386949
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the last decade of automatic speech recognition (ASR) research, the
introduction of deep learning brought considerable reductions in word error
rate of more than 50% relative, compared to modeling without deep learning. In
the wake of this transition, a number of all-neural ASR architectures were
introduced. These so-called end-to-end (E2E) models provide highly integrated,
completely neural ASR models, which rely strongly on general machine learning
knowledge, learn more consistently from data, while depending less on ASR
domain-specific experience. The success and enthusiastic adoption of deep
learning accompanied by more generic model architectures lead to E2E models now
becoming the prominent ASR approach. The goal of this survey is to provide a
taxonomy of E2E ASR models and corresponding improvements, and to discuss their
properties and their relation to the classical hidden Markov model (HMM) based
ASR architecture. All relevant aspects of E2E ASR are covered in this work:
modeling, training, decoding, and external language model integration,
accompanied by discussions of performance and deployment opportunities, as well
as an outlook into potential future developments.
Related papers
- A Survey on All-in-One Image Restoration: Taxonomy, Evaluation and Future Trends [67.43992456058541]
Image restoration (IR) refers to the process of improving visual quality of images while removing degradation, such as noise, blur, weather effects, and so on.
Traditional IR methods typically target specific types of degradation, which limits their effectiveness in real-world scenarios with complex distortions.
The all-in-one image restoration (AiOIR) paradigm has emerged, offering a unified framework that adeptly addresses multiple degradation types.
arXiv Detail & Related papers (2024-10-19T11:11:09Z) - Enhancing CTC-based speech recognition with diverse modeling units [2.723573795552244]
In recent years, the evolution of end-to-end (E2E) automatic speech recognition (ASR) models has been remarkable.
On top of E2E systems, researchers have achieved substantial accuracy improvement by rescoring E2E model's N-best hypotheses with a phoneme-based model.
We propose an efficient joint training approach, where E2E models are trained jointly with diverse modeling units.
arXiv Detail & Related papers (2024-06-05T13:52:55Z) - Integrating Pre-Trained Speech and Language Models for End-to-End Speech Recognition [12.77573161345651]
This paper proposes integrating a pre-trained speech representation model and a large language model (LLM) for E2E ASR.
The proposed model enables the optimization of the entire ASR process, including acoustic feature extraction and acoustic and language modeling.
arXiv Detail & Related papers (2023-12-06T18:34:42Z) - ASR-Aware End-to-end Neural Diarization [15.172086811068962]
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both acoustic input and features derived from an automatic speech recognition (ASR) model.
Three modifications to the Conformer-based EEND architecture are proposed to incorporate the features.
Experiments on the two-speaker English conversations of Switchboard+SRE data sets show that multi-task learning with position-in-word information is the most effective way of utilizing ASR features.
arXiv Detail & Related papers (2022-02-02T21:17:14Z) - Consistent Training and Decoding For End-to-end Speech Recognition Using
Lattice-free MMI [67.13999010060057]
We propose a novel approach to integrate LF-MMI criterion into E2E ASR frameworks in both training and decoding stages.
Experiments suggest that the introduction of the LF-MMI criterion consistently leads to significant performance improvements.
arXiv Detail & Related papers (2021-12-05T07:30:17Z) - Integrating Categorical Features in End-to-End ASR [1.332560004325655]
All-neural, end-to-end ASR systems convert speech input to text units using a single trainable neural network model.
E2E models require large amounts of paired speech text data that is expensive to obtain.
We propose a simple yet effective way to integrate categorical features into E2E model.
arXiv Detail & Related papers (2021-10-06T20:07:53Z) - SynthASR: Unlocking Synthetic Data for Speech Recognition [15.292920497489925]
We propose to utilize synthetic speech for ASR training ( SynthASR) in applications where data is sparse or hard to get for ASR model training.
In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio improved the recognition performance on new application by more than 65% relative.
arXiv Detail & Related papers (2021-06-14T23:26:44Z) - Towards Lifelong Learning of End-to-end ASR [81.15661413476221]
Lifelong learning aims to enable a machine to sequentially learn new tasks from new datasets describing the changing real world without forgetting the previously learned knowledge.
An overall relative reduction of 28.7% in WER was achieved compared to the fine-tuning baseline when sequentially learning on three very different benchmark corpora.
arXiv Detail & Related papers (2021-04-04T13:48:53Z) - CorDEL: A Contrastive Deep Learning Approach for Entity Linkage [70.82533554253335]
Entity linkage (EL) is a critical problem in data cleaning and integration.
With the ever-increasing growth of new data, deep learning (DL) based approaches have been proposed to alleviate the high cost of EL associated with the traditional models.
We argue that the twin-network architecture is sub-optimal to EL, leading to inherent drawbacks of existing models.
arXiv Detail & Related papers (2020-09-15T16:33:05Z) - Joint Contextual Modeling for ASR Correction and Language Understanding [60.230013453699975]
We propose multi-task neural approaches to perform contextual language correction on ASR outputs jointly with language understanding (LU)
We show that the error rates of off the shelf ASR and following LU systems can be reduced significantly by 14% relative with joint models trained using small amounts of in-domain data.
arXiv Detail & Related papers (2020-01-28T22:09:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.