The Conformer Encoder May Reverse the Time Dimension
- URL: http://arxiv.org/abs/2410.00680v1
- Date: Tue, 1 Oct 2024 13:39:05 GMT
- Title: The Conformer Encoder May Reverse the Time Dimension
- Authors: Robin Schmitt, Albert Zeyer, Mohammad Zeineldeen, Ralf Schlüter, Hermann Ney,
- Abstract summary: We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames.
We propose several methods and ideas of how this flipping can be avoided.
- Score: 53.9351497436903
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We sometimes observe monotonically decreasing cross-attention weights in our Conformer-based global attention-based encoder-decoder (AED) models. Further investigation shows that the Conformer encoder internally reverses the sequence in the time dimension. We analyze the initial behavior of the decoder cross-attention mechanism and find that it encourages the Conformer encoder self-attention to build a connection between the initial frames and all other informative frames. Furthermore, we show that, at some point in training, the self-attention module of the Conformer starts dominating the output over the preceding feed-forward module, which then only allows the reversed information to pass through. We propose several methods and ideas of how this flipping can be avoided. Additionally, we investigate a novel method to obtain label-frame-position alignments by using the gradients of the label log probabilities w.r.t. the encoder input frames.
Related papers
- Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition [9.803556181225193]
Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance.
However, the Conformer-based model encounters an issue with the self-attention mechanism.
We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
arXiv Detail & Related papers (2023-10-23T13:55:49Z) - Chunked Attention-based Encoder-Decoder Model for Streaming Speech
Recognition [42.04873382667665]
We study a streamable attention-based encoder-decoder model in which either the decoder, or both the encoder and decoder, operate on pre-defined, fixed-size windows called chunks.
A special end-of-chunk symbol advances from one chunk to the next chunk, effectively replacing the conventional end-of-sequence symbol.
We find that our model maintains competitive performance compared to the non-streamable variant and generalizes very well to long-form speech.
arXiv Detail & Related papers (2023-09-15T14:36:24Z) - Think Twice before Driving: Towards Scalable Decoders for End-to-End
Autonomous Driving [74.28510044056706]
Existing methods usually adopt the decoupled encoder-decoder paradigm.
In this work, we aim to alleviate the problem by two principles.
We first predict a coarse-grained future position and action based on the encoder features.
Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly.
arXiv Detail & Related papers (2023-05-10T15:22:02Z) - When Counting Meets HMER: Counting-Aware Network for Handwritten
Mathematical Expression Recognition [57.51793420986745]
We propose an unconventional network for handwritten mathematical expression recognition (HMER) named Counting-Aware Network (CAN)
We design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations.
Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models.
arXiv Detail & Related papers (2022-07-23T08:39:32Z) - Sparsity and Sentence Structure in Encoder-Decoder Attention of
Summarization Systems [38.672160430296536]
Transformer models have achieved state-of-the-art results in a wide range of NLP tasks including summarization.
Previous work has focused on one important bottleneck, the quadratic self-attention mechanism in the encoder.
This work focuses on the transformer's encoder-decoder attention mechanism.
arXiv Detail & Related papers (2021-09-08T19:32:42Z) - Decoder Fusion RNN: Context and Interaction Aware Decoders for
Trajectory Prediction [53.473846742702854]
We propose a recurrent, attention-based approach for motion forecasting.
Decoder Fusion RNN (DF-RNN) is composed of a recurrent behavior encoder, an inter-agent multi-headed attention module, and a context-aware decoder.
We demonstrate the efficacy of our method by testing it on the Argoverse motion forecasting dataset and show its state-of-the-art performance on the public benchmark.
arXiv Detail & Related papers (2021-08-12T15:53:37Z) - Autoencoding Variational Autoencoder [56.05008520271406]
We study the implications of this behaviour on the learned representations and also the consequences of fixing it by introducing a notion of self consistency.
We show that encoders trained with our self-consistency approach lead to representations that are robust (insensitive) to perturbations in the input introduced by adversarial attacks.
arXiv Detail & Related papers (2020-12-07T14:16:14Z) - Cross-Thought for Sentence Encoder Pre-training [89.32270059777025]
Cross-Thought is a novel approach to pre-training sequence encoder.
We train a Transformer-based sequence encoder over a large set of short sequences.
Experiments on question answering and textual entailment tasks demonstrate that our pre-trained encoder can outperform state-of-the-art encoders.
arXiv Detail & Related papers (2020-10-07T21:02:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.