Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition
- URL: http://arxiv.org/abs/2310.14954v2
- Date: Sat, 28 Oct 2023 14:38:46 GMT
- Title: Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition
- Authors: Peng Fan, Changhao Shan, Sining Sun, Qing Yang, Jianwei Zhang
- Abstract summary: Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance.
However, the Conformer-based model encounters an issue with the self-attention mechanism.
We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
- Score: 9.803556181225193
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Conformer as a backbone network for end-to-end automatic speech
recognition achieved state-of-the-art performance. The Conformer block
leverages a self-attention mechanism to capture global information, along with
a convolutional neural network to capture local information, resulting in
improved performance. However, the Conformer-based model encounters an issue
with the self-attention mechanism, as computational complexity grows
quadratically with the length of the input sequence. Inspired by previous
Connectionist Temporal Classification (CTC) guided blank skipping during
decoding, we introduce intermediate CTC outputs as guidance into the
downsampling procedure of the Conformer encoder. We define the frame with
non-blank output as key frame. Specifically, we introduce the key frame-based
self-attention (KFSA) mechanism, a novel method to reduce the computation of
the self-attention mechanism using key frames. The structure of our proposed
approach comprises two encoders. Following the initial encoder, we introduce an
intermediate CTC loss function to compute the label frame, enabling us to
extract the key frames and blank frames for KFSA. Furthermore, we introduce the
key frame-based downsampling (KFDS) mechanism to operate on high-dimensional
acoustic features directly and drop the frames corresponding to blank labels,
which results in new acoustic feature sequences as input to the second encoder.
By using the proposed method, which achieves comparable or higher performance
than vanilla Conformer and other similar work such as Efficient Conformer.
Meantime, our proposed method can discard more than 60\% useless frames during
model training and inference, which will accelerate the inference speed
significantly. This work code is available in
{https://github.com/scufan1990/Key-Frame-Mechanism-For-Efficient-Conformer}
Related papers
- The Conformer Encoder May Reverse the Time Dimension [53.9351497436903]
We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames.
We propose several methods and ideas of how this flipping can be avoided.
arXiv Detail & Related papers (2024-10-01T13:39:05Z) - Skipformer: A Skip-and-Recover Strategy for Efficient Speech Recognition [7.963605445905696]
Conformer-based attention models have become the de facto backbone model for Automatic Speech Recognition tasks.
We propose a "Skip-and-Recover" Conformer architecture, named Skipformer, to squeeze sequence input length dynamically and inhomogeneously.
Our model reduces the input sequence length by 31 times on Aishell-1 and 22 times on Librispeech corpus.
arXiv Detail & Related papers (2024-03-13T05:20:45Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - IBVC: Interpolation-driven B-frame Video Compression [68.18440522300536]
B-frame video compression aims to adopt bi-directional motion estimation and motion compensation (MEMC) coding for middle frame reconstruction.
Previous learned approaches often directly extend neural P-frame codecs to B-frame relying on bi-directional optical-flow estimation.
We propose a simple yet effective structure called Interpolation-B-frame Video Compression (IBVC) to address these issues.
arXiv Detail & Related papers (2023-09-25T02:45:51Z) - Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token.
UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - End-to-End Learning for Video Frame Compression with Self-Attention [25.23586503813838]
We propose an end-to-end learned system for compressing video frames.
Our system learns deep embeddings of frames and encodes their difference in latent space.
In our experiments, we show that the proposed system achieves high compression rates and high objective visual quality.
arXiv Detail & Related papers (2020-04-20T12:11:08Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.