The Conformer Encoder May Reverse the Time Dimension
- URL: http://arxiv.org/abs/2410.00680v2
- Date: Wed, 15 Jan 2025 15:18:25 GMT
- Title: The Conformer Encoder May Reverse the Time Dimension
- Authors: Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney,
- Abstract summary: We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention.
We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments.
- Score: 53.9351497436903
- License:
- Abstract: We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models, Further investigation shows that the Conformer encoder reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose methods and ideas of how this flipping can be avoided and investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
Related papers
- Aligner-Encoders: Self-Attention Transformers Can Be Self-Transducers [14.91083492000769]
We show that the transformer-based encoder adopted in recent years is capable of performing the alignment internally during the forward pass.
This new phenomenon enables a simpler and more efficient model, the "Aligner-Encoder"
We conduct experiments demonstrating performance remarkably close to the state of the art.
arXiv Detail & Related papers (2025-02-06T22:09:52Z) - Lightweight Transducer Based on Frame-Level Criterion [14.518972562566642]
We propose a lightweight transducer model based on frame-level criterion, which uses the results of the CTC forced alignment algorithm to determine the label for each frame.
To address the problem of imbalanced classification caused by excessive blanks in the label, we decouple the blank and non-blank probabilities.
Experiments on the AISHELL-1 demonstrate that this enables the lightweight transducer to achieve similar results to transducer.
arXiv Detail & Related papers (2024-09-05T02:24:18Z) - Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition [9.803556181225193]
Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance.
However, the Conformer-based model encounters an issue with the self-attention mechanism.
We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
arXiv Detail & Related papers (2023-10-23T13:55:49Z) - Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - Learning to Drop Out: An Adversarial Approach to Training Sequence VAEs [16.968490007064872]
Applying variational autoencoders (VAEs) to sequential data offers a method for controlled sequence generation, manipulation, and structured representation learning.
We show theoretically that this removes pointwise mutual information provided by the decoder input, which is compensated for by utilizing the latent space.
Compared to uniform dropout on standard text benchmark datasets, our targeted approach increases both sequence performance and the information captured in the latent space.
arXiv Detail & Related papers (2022-09-26T11:21:19Z) - When Counting Meets HMER: Counting-Aware Network for Handwritten
Mathematical Expression Recognition [57.51793420986745]
We propose an unconventional network for handwritten mathematical expression recognition (HMER) named Counting-Aware Network (CAN)
We design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations.
Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models.
arXiv Detail & Related papers (2022-07-23T08:39:32Z) - Decoder Fusion RNN: Context and Interaction Aware Decoders for
Trajectory Prediction [53.473846742702854]
We propose a recurrent, attention-based approach for motion forecasting.
Decoder Fusion RNN (DF-RNN) is composed of a recurrent behavior encoder, an inter-agent multi-headed attention module, and a context-aware decoder.
We demonstrate the efficacy of our method by testing it on the Argoverse motion forecasting dataset and show its state-of-the-art performance on the public benchmark.
arXiv Detail & Related papers (2021-08-12T15:53:37Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.