Adding Connectionist Temporal Summarization into Conformer to Improve
Its Decoder Efficiency For Speech Recognition
- URL: http://arxiv.org/abs/2204.03889v1
- Date: Fri, 8 Apr 2022 07:24:00 GMT
- Title: Adding Connectionist Temporal Summarization into Conformer to Improve
Its Decoder Efficiency For Speech Recognition
- Authors: Nick J.C. Wang, Zongfeng Quan, Shaojun Wang, Jing Xiao
- Abstract summary: We propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder.
With a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20%.
The word error rate (WER) is reduced by 6% relative at the beam width of 1 and by 3% relative at the beam width of 4.
- Score: 22.61761934996406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Conformer model is an excellent architecture for speech recognition
modeling that effectively utilizes the hybrid losses of connectionist temporal
classification (CTC) and attention to train model parameters. To improve the
decoding efficiency of Conformer, we propose a novel connectionist temporal
summarization (CTS) method that reduces the number of frames required for the
attention decoder fed from the acoustic sequences generated by the encoder,
thus reducing operations. However, to achieve such decoding improvements, we
must fine-tune model parameters, as cross-attention observations are changed
and thus require corresponding refinements. Our final experiments show that,
with a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up
to 20% and for FluentSpeech data it can be reduced by 11%, without losing ASR
accuracy. An improvement in accuracy is even found for the LibriSpeech
"test-other" set. The word error rate (WER) is reduced by 6\% relative at the
beam width of 1 and by 3% relative at the beam width of 4.
Related papers
- Accelerating Codec-based Speech Synthesis with Multi-Token Prediction and Speculative Decoding [24.472393096460774]
We propose an enhanced inference method that allows for flexible trade-offs between speed and quality during inference without requiring additional training.
Our core idea is to predict multiple tokens per inference step of the AR module using multiple prediction heads.
In experiments, we demonstrate that the time required to predict each token is reduced by a factor of 4 to 5 compared to baseline models.
arXiv Detail & Related papers (2024-10-17T17:55:26Z) - A Principled Hierarchical Deep Learning Approach to Joint Image
Compression and Classification [27.934109301041595]
This work proposes a three-step joint learning strategy to guide encoders to extract features that are compact, discriminative, and amenable to common augmentations/transformations.
Tests show that our proposed method achieves accuracy improvement of up to 1.5% on CIFAR-10 and 3% on CIFAR-100 over conventional E2E cross-entropy training.
arXiv Detail & Related papers (2023-10-30T15:52:18Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Learning Quantization in LDPC Decoders [14.37550972719183]
We propose a floating-point surrogate model that imitates quantization effects as additions of uniform noise.
A deep learning-based method is then applied to optimize the message bitwidths.
We report an error-rate performance within 0.2 dB of floating-point decoding at an average message quantization bitwidth of 3.1 bits.
arXiv Detail & Related papers (2022-08-10T07:07:54Z) - Pre-Training Transformer Decoder for End-to-End ASR Model with Unpaired
Speech Data [145.95460945321253]
We introduce two pre-training tasks for the encoder-decoder network using acoustic units, i.e., pseudo codes.
The proposed Speech2C can relatively reduce the word error rate (WER) by 19.2% over the method without decoder pre-training.
arXiv Detail & Related papers (2022-03-31T15:33:56Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z) - Scaling Up Online Speech Recognition Using ConvNets [33.75588539732141]
We design an online end-to-end speech recognition system based on Time-Depth Separable ( TDS) convolutions and Connectionist Temporal Classification (CTC)
We improve the core TDS architecture in order to limit the future context and hence reduce latency while maintaining accuracy.
The system has almost three times the throughput of a well tuned hybrid ASR baseline while also having lower latency and a better word error rate.
arXiv Detail & Related papers (2020-01-27T12:55:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.