Conformer: Convolution-augmented Transformer for Speech Recognition
- URL: http://arxiv.org/abs/2005.08100v1
- Date: Sat, 16 May 2020 20:56:25 GMT
- Title: Conformer: Convolution-augmented Transformer for Speech Recognition
- Authors: Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang,
Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, Ruoming Pang
- Abstract summary: Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
- Score: 60.119604551507805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently Transformer and Convolution neural network (CNN) based models have
shown promising results in Automatic Speech Recognition (ASR), outperforming
Recurrent neural networks (RNNs). Transformer models are good at capturing
content-based global interactions, while CNNs exploit local features
effectively. In this work, we achieve the best of both worlds by studying how
to combine convolution neural networks and transformers to model both local and
global dependencies of an audio sequence in a parameter-efficient way. To this
regard, we propose the convolution-augmented transformer for speech
recognition, named Conformer. Conformer significantly outperforms the previous
Transformer and CNN based models achieving state-of-the-art accuracies. On the
widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without
using a language model and 1.9%/3.9% with an external language model on
test/testother. We also observe competitive performance of 2.7%/6.3% with a
small model of only 10M parameters.
Related papers
- Improving Transformer-based Networks With Locality For Automatic Speaker
Verification [40.06788577864032]
Transformer-based architectures have been explored for speaker embedding extraction.
In this study, we enhance the Transformer with the locality modeling in two directions.
We evaluate the proposed approaches on the VoxCeleb datasets and a large-scale Microsoft internal multilingual (MS-internal) dataset.
arXiv Detail & Related papers (2023-02-17T01:04:51Z) - Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge
Distillation [6.617487928813374]
We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers.
We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of.483 mAP on AudioSet.
arXiv Detail & Related papers (2022-11-09T09:58:22Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Efficient Training of Audio Transformers with Patchout [7.073210405344709]
We propose a novel method to optimize and regularize transformers on audio spectrograms.
The proposed models achieve a new state-of-the-art performance on Audioset and can be trained on a single consumer-grade GPU.
arXiv Detail & Related papers (2021-10-11T08:07:50Z) - A Battle of Network Structures: An Empirical Study of CNN, Transformer,
and MLP [121.35904748477421]
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
Transformer and multi-layer perceptron (MLP)-based models, such as Vision Transformer and Vision-Mixer, started to lead new trends.
In this paper, we conduct empirical studies on these DNN structures and try to understand their respective pros and cons.
arXiv Detail & Related papers (2021-08-30T06:09:02Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - TransfoRNN: Capturing the Sequential Information in Self-Attention
Representations for Language Modeling [9.779600950401315]
We propose to cascade the recurrent neural networks to the Transformers, which referred to as the TransfoRNN model, to capture the sequential information.
We found that the TransfoRNN models which consists of only shallow Transformers stack is suffice to give comparable, if not better, performance.
arXiv Detail & Related papers (2021-04-04T09:31:18Z) - ContextNet: Improving Convolutional Neural Networks for Automatic Speech
Recognition with Global Context [58.40112382877868]
We propose a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.
We demonstrate that ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets.
arXiv Detail & Related papers (2020-05-07T01:03:18Z) - End-to-End Multi-speaker Speech Recognition with Transformer [88.22355110349933]
We replace the RNN-based encoder-decoder in the speech recognition model with a Transformer architecture.
We also modify the self-attention component to be restricted to a segment rather than the whole sequence in order to reduce computation.
arXiv Detail & Related papers (2020-02-10T16:29:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.