Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End
Speech Recognition
- URL: http://arxiv.org/abs/2208.07657v1
- Date: Tue, 16 Aug 2022 10:40:15 GMT
- Title: Uconv-Conformer: High Reduction of Input Sequence Length for End-to-End
Speech Recognition
- Authors: Andrei Andrusenko, Rauf Nasretdinov, Aleksei Romanenko
- Abstract summary: The work proposes a new Uconv-Conformer architecture based on the standard Conformer model.
We use upsampling blocks similar to the U-Net architecture to ensure the correct CTC loss calculation and stabilize network training.
The Uconv-Conformer architecture appears to be not only faster in terms of training and inference but also shows better WER compared to the baseline Conformer.
- Score: 3.3627327936627416
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Optimization of modern ASR architectures is among the highest priority tasks
since it saves many computational resources for model training and inference.
The work proposes a new Uconv-Conformer architecture based on the standard
Conformer model that consistently reduces the input sequence length by 16
times, which results in speeding up the work of the intermediate layers. To
solve the convergence problem with such a significant reduction of the time
dimension, we use upsampling blocks similar to the U-Net architecture to ensure
the correct CTC loss calculation and stabilize network training. The
Uconv-Conformer architecture appears to be not only faster in terms of training
and inference but also shows better WER compared to the baseline Conformer. Our
best Uconv-Conformer model showed 40.3% epoch training time reduction, 47.8%,
and 23.5% inference acceleration on the CPU and GPU, respectively. Relative WER
on Librispeech test_clean and test_other decreased by 7.3% and 9.2%.
Related papers
- Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices.
Common approaches to address this issue are pruning and mixed-precision quantization.
We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Efficient Neural Net Approaches in Metal Casting Defect Detection [0.0]
This research proposes a lightweight architecture that is efficient in terms of accuracy and inference time.
Our results indicate that a custom model of 590K parameters with depth-wise separable convolutions outperformed pretrained architectures.
arXiv Detail & Related papers (2022-08-08T13:54:36Z) - SmoothNets: Optimizing CNN architecture design for differentially
private deep learning [69.10072367807095]
DPSGD requires clipping and noising of per-sample gradients.
This introduces a reduction in model utility compared to non-private training.
We distilled a new model architecture termed SmoothNet, which is characterised by increased robustness to the challenges of DP-SGD training.
arXiv Detail & Related papers (2022-05-09T07:51:54Z) - Pruning In Time (PIT): A Lightweight Network Architecture Optimizer for
Temporal Convolutional Networks [20.943095081056857]
Temporal Convolutional Networks (TCNs) are promising Deep Learning models for time-series processing tasks.
We propose an automatic dilation, which tackles the problem as a weight pruning on the time-axis, and learns dilation factors together with weights, in a single training.
arXiv Detail & Related papers (2022-03-28T14:03:16Z) - Optimization Planning for 3D ConvNets [123.43419144051703]
It is not trivial to optimally learn a 3D Convolutional Neural Networks (3D ConvNets) due to high complexity and various options of the training scheme.
We decompose the path into a series of training "states" and specify the hyper- parameters, e.g., learning rate and the length of input clips, in each state.
We perform dynamic programming over all the candidate states to plan the optimal permutation of states, i.e., optimization path.
arXiv Detail & Related papers (2022-01-11T16:13:31Z) - Efficient conformer: Progressive downsampling and grouped attention for
automatic speech recognition [2.6346614942667235]
We study how to reduce the Conformer architecture complexity with a limited computing budget.
We introduce progressive downsampling to the Conformer encoder and propose a novel attention mechanism named grouped attention.
Within the same computing budget, the proposed architecture achieves better performances with faster training and decoding.
arXiv Detail & Related papers (2021-08-31T07:48:06Z) - An Information Theory-inspired Strategy for Automatic Network Pruning [88.51235160841377]
Deep convolution neural networks are well known to be compressed on devices with resource constraints.
Most existing network pruning methods require laborious human efforts and prohibitive computation resources.
We propose an information theory-inspired strategy for automatic model compression.
arXiv Detail & Related papers (2021-08-19T07:03:22Z) - EfficientNetV2: Smaller Models and Faster Training [91.77432224225221]
This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models.
We use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency.
Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller.
arXiv Detail & Related papers (2021-04-01T07:08:36Z) - Hybrid In-memory Computing Architecture for the Training of Deep Neural
Networks [5.050213408539571]
We propose a hybrid in-memory computing architecture for the training of deep neural networks (DNNs) on hardware accelerators.
We show that HIC-based training results in about 50% less inference model size to achieve baseline comparable accuracy.
Our simulations indicate HIC-based training naturally ensures that the number of write-erase cycles seen by the devices is a small fraction of the endurance limit of PCM.
arXiv Detail & Related papers (2021-02-10T05:26:27Z) - FBNetV3: Joint Architecture-Recipe Search using Predictor Pretraining [65.39532971991778]
We present an accuracy predictor that scores architecture and training recipes jointly, guiding both sample selection and ranking.
We run fast evolutionary searches in just CPU minutes to generate architecture-recipe pairs for a variety of resource constraints.
FBNetV3 makes up a family of state-of-the-art compact neural networks that outperform both automatically and manually-designed competitors.
arXiv Detail & Related papers (2020-06-03T05:20:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.