Attention based on-device streaming speech recognition with large speech
corpus
- URL: http://arxiv.org/abs/2001.00577v1
- Date: Thu, 2 Jan 2020 04:24:44 GMT
- Title: Attention based on-device streaming speech recognition with large speech
corpus
- Authors: Kwangyoun Kim, Kyungmin Lee, Dhananjaya Gowda, Junmo Park, Sungsoo
Kim, Sichen Jin, Young-Yoon Lee, Jinsu Yeo, Daehyun Kim, Seokyeong Jung,
Jungin Lee, Myoungji Han, Chanwoo Kim
- Abstract summary: We present a new on-device automatic speech recognition (ASR) system based on monotonic chunk-wise attention (MoChA) models trained with large (> 10K hours) corpus.
We attained around 90% of a word recognition rate for general domain mainly by using joint training of connectionist temporal classifier (CTC) and cross entropy (CE) losses.
For on-demand adaptation, we fused the MoChA models with statistical n-gram models, and we could achieve a relatively 36% improvement on average in word error rate (WER) for target domains including the general domain.
- Score: 16.702653972113023
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a new on-device automatic speech recognition (ASR)
system based on monotonic chunk-wise attention (MoChA) models trained with
large (> 10K hours) corpus. We attained around 90% of a word recognition rate
for general domain mainly by using joint training of connectionist temporal
classifier (CTC) and cross entropy (CE) losses, minimum word error rate (MWER)
training, layer-wise pre-training and data augmentation methods. In addition,
we compressed our models by more than 3.4 times smaller using an iterative
hyper low-rank approximation (LRA) method while minimizing the degradation in
recognition accuracy. The memory footprint was further reduced with 8-bit
quantization to bring down the final model size to lower than 39 MB. For
on-demand adaptation, we fused the MoChA models with statistical n-gram models,
and we could achieve a relatively 36% improvement on average in word error rate
(WER) for target domains including the general domain.
Related papers
- Optimization of DNN-based speaker verification model through efficient quantization technique [15.250677730668466]
Quantization of deep models offers a means to reduce both computational and memory expenses.
Our research proposes an optimization framework for the quantization of the speaker verification model.
arXiv Detail & Related papers (2024-07-12T05:03:10Z) - Neural Language Model Pruning for Automatic Speech Recognition [4.10609794373612]
We study model pruning methods applied to Transformer-based neural network language models for automatic speech recognition.
We explore three aspects of the pruning frame work, namely criterion, method and scheduler, analyzing their contribution in terms of accuracy and inference speed.
arXiv Detail & Related papers (2023-10-05T10:01:32Z) - Efficient Speech Representation Learning with Low-Bit Quantization [32.75829498841329]
We apply and investigate recent quantization techniques on speech representation learning models.
With aggressive quantization to 1 bit, we achieved 86.32% storage reduction (184.42 -> 25.23), 88% estimated runtime reduction (1.00 -> 0.12) with increased word error rate (7.06 -> 15.96).
In comparison with DistillHuBERT which also aims for model compression, the 2-bit configuration yielded slightly smaller storage (35.84 vs. 46.98), better word error rate (12.68 vs. 13.37) and more efficient runtime estimated (0.15 vs. 0.73)
arXiv Detail & Related papers (2022-12-14T06:09:08Z) - Pushing the Limits of Asynchronous Graph-based Object Detection with
Event Cameras [62.70541164894224]
We introduce several architecture choices which allow us to scale the depth and complexity of such models while maintaining low computation.
Our method runs 3.7 times faster than a dense graph neural network, taking only 8.4 ms per forward pass.
arXiv Detail & Related papers (2022-11-22T15:14:20Z) - A Conformer Based Acoustic Model for Robust Automatic Speech Recognition [63.242128956046024]
The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
arXiv Detail & Related papers (2022-03-01T20:17:31Z) - An Efficient Deep Learning Model for Automatic Modulation Recognition
Based on Parameter Estimation and Transformation [3.3941243094128035]
This letter proposes an efficient DL-AMR model based on phase parameter estimation and transformation.
Our model is more competitive in training time and test time than the benchmark models with similar recognition accuracy.
arXiv Detail & Related papers (2021-10-11T03:28:28Z) - DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech
Signals [11.939409227407769]
We propose a novel pitch estimation technique called DeepF0.
It leverages the available annotated data to directly learn from the raw audio in a data-driven manner.
arXiv Detail & Related papers (2021-02-11T23:11:22Z) - Adaptive Quantization of Model Updates for Communication-Efficient
Federated Learning [75.45968495410047]
Communication of model updates between client nodes and the central aggregating server is a major bottleneck in federated learning.
Gradient quantization is an effective way of reducing the number of bits required to communicate each model update.
We propose an adaptive quantization strategy called AdaFL that aims to achieve communication efficiency as well as a low error floor.
arXiv Detail & Related papers (2021-02-08T19:14:21Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - Speaker Representation Learning using Global Context Guided Channel and
Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations.
The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.