Related papers: Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition

URL: http://arxiv.org/abs/2310.14954v2
Date: Sat, 28 Oct 2023 14:38:46 GMT
Title: Key Frame Mechanism For Efficient Conformer Based End-to-end Speech Recognition
Authors: Peng Fan, Changhao Shan, Sining Sun, Qing Yang, Jianwei Zhang
Abstract summary: Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. However, the Conformer-based model encounters an issue with the self-attention mechanism. We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
Score: 9.803556181225193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance. The Conformer block leverages a self-attention mechanism to capture global information, along with a convolutional neural network to capture local information, resulting in improved performance. However, the Conformer-based model encounters an issue with the self-attention mechanism, as computational complexity grows quadratically with the length of the input sequence. Inspired by previous Connectionist Temporal Classification (CTC) guided blank skipping during decoding, we introduce intermediate CTC outputs as guidance into the downsampling procedure of the Conformer encoder. We define the frame with non-blank output as key frame. Specifically, we introduce the key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames. The structure of our proposed approach comprises two encoders. Following the initial encoder, we introduce an intermediate CTC loss function to compute the label frame, enabling us to extract the key frames and blank frames for KFSA. Furthermore, we introduce the key frame-based downsampling (KFDS) mechanism to operate on high-dimensional acoustic features directly and drop the frames corresponding to blank labels, which results in new acoustic feature sequences as input to the second encoder. By using the proposed method, which achieves comparable or higher performance than vanilla Conformer and other similar work such as Efficient Conformer. Meantime, our proposed method can discard more than 60\% useless frames during model training and inference, which will accelerate the inference speed significantly. This work code is available in {https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}

Related papers

Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning [81.02648336552421]
We propose a Multi-Constraint Consistency Learning approach to facilitate the staged enhancement of the encoder and decoder. Self-adaptive feature masking and noise injection are designed in an instance-specific manner to perturb the features for robust learning of the decoder. Experimental results on Pascal VOC2012 and Cityscapes datasets demonstrate that our proposed MCCL achieves new state-of-the-art performance.
arXiv Detail & Related papers (2025-03-23T03:21:33Z)
Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition [26.665132884613477]
The Spike Window Decoding algorithm greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets.
arXiv Detail & Related papers (2025-01-01T12:20:07Z)
The Conformer Encoder May Reverse the Time Dimension [53.9351497436903]
We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. We propose several methods and ideas of how this flipping can be avoided.
arXiv Detail & Related papers (2024-10-01T13:39:05Z)
Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition [7.963605445905696]
Conformer-based attention models have become the de facto backbone model for Automatic Speech Recognition tasks. We propose a "Skip-and-Recover" Conformer architecture, named Skipformer, to squeeze sequence input length dynamically and inhomogeneously. Our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus.
arXiv Detail & Related papers (2024-03-13T05:20:45Z)
Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z)
IBVC: Interpolation-driven B-frame Video Compression [68.18440522300536]
B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction. Previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation. We propose a simple yet effective structure called Interpolation-B-frame Video Compression (IBVC) to address these issues.
arXiv Detail & Related papers (2023-09-25T02:45:51Z)
Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token. UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z)
Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience. Existing methods have achieved great success by employing advanced motion models and synthesis networks. WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z)
Alignment Knowledge Distillation for Online Streaming Attention-based Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems. The proposed method significantly reduces recognition errors and emission latency simultaneously. The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z)
Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)
End-to-End Learning for Video Frame Compression with Self-Attention [25.23586503813838]
We propose an end-to-end learned system for compressing video frames. Our system learns deep embeddings of frames and encodes their difference in latent space. In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality.
arXiv Detail & Related papers (2020-04-20T12:11:08Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.