Harnessing Attention Mechanisms: Efficient Sequence Reduction using
Attention-based Autoencoders
- URL: http://arxiv.org/abs/2310.14837v1
- Date: Mon, 23 Oct 2023 11:57:44 GMT
- Title: Harnessing Attention Mechanisms: Efficient Sequence Reduction using
Attention-based Autoencoders
- Authors: Daniel Biermann, Fabrizio Palumbo, Morten Goodwin, Ole-Christoffer
Granmo
- Abstract summary: We introduce a novel attention-based method that allows for the direct manipulation of sequence lengths.
We show that the autoencoder retains all the significant information when reducing the original sequence to half its original size.
- Score: 14.25761027376296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many machine learning models use the manipulation of dimensions as a driving
force to enable models to identify and learn important features in data. In the
case of sequential data this manipulation usually happens on the token
dimension level. Despite the fact that many tasks require a change in sequence
length itself, the step of sequence length reduction usually happens out of
necessity and in a single step. As far as we are aware, no model uses the
sequence length reduction step as an additional opportunity to tune the models
performance. In fact, sequence length manipulation as a whole seems to be an
overlooked direction. In this study we introduce a novel attention-based method
that allows for the direct manipulation of sequence lengths. To explore the
method's capabilities, we employ it in an autoencoder model. The autoencoder
reduces the input sequence to a smaller sequence in latent space. It then aims
to reproduce the original sequence from this reduced form. In this setting, we
explore the methods reduction performance for different input and latent
sequence lengths. We are able to show that the autoencoder retains all the
significant information when reducing the original sequence to half its
original size. When reducing down to as low as a quarter of its original size,
the autoencoder is still able to reproduce the original sequence with an
accuracy of around 90%.
Related papers
- CItruS: Chunked Instruction-aware State Eviction for Long Sequence Modeling [52.404072802235234]
We introduce Chunked Instruction-aware State Eviction (CItruS), a novel modeling technique that integrates the attention preferences useful for a downstream task into the eviction process of hidden states.
Our training-free method exhibits superior performance on long sequence comprehension and retrieval tasks over several strong baselines under the same memory budget.
arXiv Detail & Related papers (2024-06-17T18:34:58Z) - Breaking the Attention Bottleneck [0.0]
This paper develops a generative function as attention or activation replacement.
It still has the auto-regressive character by comparing each token with the previous one.
The concept of attention replacement is distributed under the AGPL v3 license at https://gitlab.com/Bachstelzecausal_generation.
arXiv Detail & Related papers (2024-06-16T12:06:58Z) - Parallel Decoding via Hidden Transfer for Lossless Large Language Model Acceleration [54.897493351694195]
We propose a novel parallel decoding approach, namely textithidden transfer, which decodes multiple successive tokens simultaneously in a single forward pass.
In terms of acceleration metrics, we outperform all the single-model acceleration techniques, including Medusa and Self-Speculative decoding.
arXiv Detail & Related papers (2024-04-18T09:17:06Z) - LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences.
We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook.
LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z) - Are We Using Autoencoders in a Wrong Way? [3.110260251019273]
Autoencoders are used for dimensionality reduction, anomaly detection and feature extraction.
We revisited the standard training for the undercomplete Autoencoder modifying the shape of the latent space.
We also explored the behaviour of the latent space in the case of reconstruction of a random sample from the whole dataset.
arXiv Detail & Related papers (2023-09-04T11:22:43Z) - SequenceMatch: Imitation Learning for Autoregressive Sequence Modelling with Backtracking [60.109453252858806]
A maximum-likelihood (MLE) objective does not match a downstream use-case of autoregressively generating high-quality sequences.
We formulate sequence generation as an imitation learning (IL) problem.
This allows us to minimize a variety of divergences between the distribution of sequences generated by an autoregressive model and sequences from a dataset.
Our resulting method, SequenceMatch, can be implemented without adversarial training or architectural changes.
arXiv Detail & Related papers (2023-06-08T17:59:58Z) - Toeplitz Neural Network for Sequence Modeling [46.04964190407727]
We show that a Toeplitz matrix-vector production trick can reduce the space-time complexity of the sequence modeling to log linear.
A lightweight sub-network called relative position encoder is proposed to generate relative position coefficients with a fixed budget of parameters.
Despite being trained on 512-token sequences, our model can extrapolate input sequence length up to 14K tokens in inference with consistent performance.
arXiv Detail & Related papers (2023-05-08T14:49:01Z) - DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
We present an efficient yet effective attention mechanism, namely the Dynamic Bilinear Low-Rank Attention (DBA)
DBA compresses the sequence length by input-sensitive dynamic projection matrices and achieves linear time and space complexity.
Experiments over tasks with diverse sequence length conditions show that DBA achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-24T03:06:36Z) - Staircase Attention for Recurrent Processing of Sequences [34.53670631387504]
Staircase attention operates across the sequence (in time) recurrently processing the input by adding another step of processing.
It is shown to be able to solve tasks that involve tracking that conventional Transformers cannot, due to this recurrence.
It is shown to provide improved modeling power for the same size model (number of parameters) compared to self-attentive Transformers on large language modeling and dialogue tasks, yielding significant perplexity gains.
arXiv Detail & Related papers (2021-06-08T12:19:31Z) - Funnel-Transformer: Filtering out Sequential Redundancy for Efficient
Language Processing [112.2208052057002]
We propose Funnel-Transformer which gradually compresses the sequence of hidden states to a shorter one.
With comparable or fewer FLOPs, Funnel-Transformer outperforms the standard Transformer on a wide variety of sequence-level prediction tasks.
arXiv Detail & Related papers (2020-06-05T05:16:23Z) - Sequence-to-Sequence Imputation of Missing Sensor Data [1.9036571490366496]
We develop a sequence-to-sequence model for recovering missing sensor data.
A forward RNN encodes the data observed before the missing sequence and a backward RNN encodes the data observed after the missing sequence.
A decoder decodes the two encoders in a novel way to predict the missing data.
arXiv Detail & Related papers (2020-02-25T09:51:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.