ContextNet: Improving Convolutional Neural Networks for Automatic Speech
Recognition with Global Context
- URL: http://arxiv.org/abs/2005.03191v3
- Date: Sat, 16 May 2020 00:49:21 GMT
- Title: ContextNet: Improving Convolutional Neural Networks for Automatic Speech
Recognition with Global Context
- Authors: Wei Han, Zhengdong Zhang, Yu Zhang, Jiahui Yu, Chung-Cheng Chiu, James
Qin, Anmol Gulati, Ruoming Pang, Yonghui Wu
- Abstract summary: We propose a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.
We demonstrate that ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets.
- Score: 58.40112382877868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolutional neural networks (CNN) have shown promising results for
end-to-end speech recognition, albeit still behind other state-of-the-art
methods in performance. In this paper, we study how to bridge this gap and go
beyond with a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global
context information into convolution layers by adding squeeze-and-excitation
modules. In addition, we propose a simple scaling method that scales the widths
of ContextNet that achieves good trade-off between computation and accuracy. We
demonstrate that on the widely used LibriSpeech benchmark, ContextNet achieves
a word error rate (WER) of 2.1%/4.6% without external language model (LM),
1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy
LibriSpeech test sets. This compares to the previous best published system of
2.0%/4.6% with LM and 3.9%/11.3% with 20M parameters. The superiority of the
proposed ContextNet model is also verified on a much larger internal dataset.
Related papers
- Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR [74.38242498079627]
Self-supervised learning (SSL) based discrete speech representations are highly compact and domain adaptable.
In this paper, SSL discrete speech features extracted from WavLM models are used as additional cross-utterance acoustic context features in Zipformer-Transducer ASR systems.
arXiv Detail & Related papers (2024-09-13T13:01:09Z) - RedBit: An End-to-End Flexible Framework for Evaluating the Accuracy of
Quantized CNNs [9.807687918954763]
Convolutional Neural Networks (CNNs) have become the standard class of deep neural network for image processing, classification and segmentation tasks.
RedBit is an open-source framework that provides a transparent, easy-to-use interface to evaluate the effectiveness of different algorithms on network accuracy.
arXiv Detail & Related papers (2023-01-15T21:27:35Z) - Towards Accurate Binary Neural Networks via Modeling Contextual
Dependencies [52.691032025163175]
Existing Binary Neural Networks (BNNs) operate mainly on local convolutions with binarization function.
We present new designs of binary neural modules, which enables leading binary neural modules by a large margin.
arXiv Detail & Related papers (2022-09-03T11:51:04Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Elastic-Link for Binarized Neural Network [9.83865304744923]
"Elastic-Link" (EL) module enrich information flow within a BNN by adaptively adding real-valued input features to the subsequent convolutional output features.
EL produces a significant improvement on the challenging large-scale ImageNet dataset.
With the integration of ReActNet, it yields a new state-of-the-art result of 71.9% top-1 accuracy.
arXiv Detail & Related papers (2021-12-19T13:49:29Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - NeuralScale: Efficient Scaling of Neurons for Resource-Constrained Deep
Neural Networks [16.518667634574026]
We search for the neuron (filter) configuration of a fixed network architecture that maximizes accuracy.
We parameterize the change of the neuron (filter) number of each layer with respect to the change in parameters, allowing us to efficiently scale an architecture across arbitrary sizes.
arXiv Detail & Related papers (2020-06-23T08:14:02Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.