A review of on-device fully neural end-to-end automatic speech
recognition algorithms
- URL: http://arxiv.org/abs/2012.07974v2
- Date: Sat, 19 Dec 2020 08:27:51 GMT
- Title: A review of on-device fully neural end-to-end automatic speech
recognition algorithms
- Authors: Chanwoo Kim, Dhananjaya Gowda, Dongsoo Lee, Jiyeon Kim, Ankur Kumar,
Sungsoo Kim, Abhinav Garg, and Changwoo Han
- Abstract summary: We review various end-to-end automatic speech recognition algorithms and their optimization techniques for on-device applications.
fully neural network end-to-end speech recognition algorithms have been proposed.
We extensively discuss their structures, performance, and advantages compared to conventional algorithms.
- Score: 20.469868150587075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we review various end-to-end automatic speech recognition
algorithms and their optimization techniques for on-device applications.
Conventional speech recognition systems comprise a large number of discrete
components such as an acoustic model, a language model, a pronunciation model,
a text-normalizer, an inverse-text normalizer, a decoder based on a Weighted
Finite State Transducer (WFST), and so on. To obtain sufficiently high speech
recognition accuracy with such conventional speech recognition systems, a very
large language model (up to 100 GB) is usually needed. Hence, the corresponding
WFST size becomes enormous, which prohibits their on-device implementation.
Recently, fully neural network end-to-end speech recognition algorithms have
been proposed. Examples include speech recognition systems based on
Connectionist Temporal Classification (CTC), Recurrent Neural Network
Transducer (RNN-T), Attention-based Encoder-Decoder models (AED), Monotonic
Chunk-wise Attention (MoChA), transformer-based speech recognition systems, and
so on. These fully neural network-based systems require much smaller memory
footprints compared to conventional algorithms, therefore their on-device
implementation has become feasible. In this paper, we review such end-to-end
speech recognition models. We extensively discuss their structures,
performance, and advantages compared to conventional algorithms.
Related papers
- SONAR: A Synthetic AI-Audio Detection Framework and Benchmark [59.09338266364506]
SONAR is a synthetic AI-Audio Detection Framework and Benchmark.
It aims to provide a comprehensive evaluation for distinguishing cutting-edge AI-synthesized auditory content.
It is the first framework to uniformly benchmark AI-audio detection across both traditional and foundation model-based deepfake detection systems.
arXiv Detail & Related papers (2024-10-06T01:03:42Z) - Automatic Speech Recognition for Hindi [0.6292138336765964]
The research involved developing a web application and designing a web interface for speech recognition.
The web application manages large volumes of audio files and their transcriptions, facilitating human correction of ASR transcripts.
The web interface for speech recognition records 16 kHz mono audio from any device running the web app, performs voice activity detection (VAD), and sends the audio to the recognition engine.
arXiv Detail & Related papers (2024-06-26T07:39:20Z) - The evaluation of a code-switched Sepedi-English automatic speech
recognition system [0.0]
We present the evaluation of the Sepedi-English code-switched automatic speech recognition system.
This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach.
The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.
arXiv Detail & Related papers (2024-03-11T15:11:28Z) - Cross-Speaker Encoding Network for Multi-Talker Speech Recognition [74.97576062152709]
Cross-MixSpeaker.
Network addresses limitations of SIMO models by aggregating cross-speaker representations.
Network is integrated with SOT to leverage both the advantages of SIMO and SISO.
arXiv Detail & Related papers (2024-01-08T16:37:45Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Knowledge Transfer from Large-scale Pretrained Language Models to
End-to-end Speech Recognizers [13.372686722688325]
Training of end-to-end speech recognizers always requires transcribed utterances.
This paper proposes a method for alleviating this issue by transferring knowledge from a language model neural network that can be pretrained with text-only data.
arXiv Detail & Related papers (2022-02-16T07:02:24Z) - Revisiting joint decoding based multi-talker speech recognition with DNN
acoustic model [34.061441900912136]
We argue that such a scheme is sub-optimal and propose a principled solution that decodes all speakers jointly.
We modify the acoustic model to predict joint state posteriors for all speakers, enabling the network to express uncertainty about the attribution of parts of the speech signal to the speakers.
arXiv Detail & Related papers (2021-10-31T09:28:04Z) - Instant One-Shot Word-Learning for Context-Specific Neural
Sequence-to-Sequence Speech Recognition [62.997667081978825]
We present an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly.
In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize.
arXiv Detail & Related papers (2021-07-05T21:08:34Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.