Related papers: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection

URL: http://arxiv.org/abs/2003.09833v3
Date: Tue, 29 Sep 2020 08:01:23 GMT
Title: SAC: Accelerating and Structuring Self-Attention via Sparse Adaptive Connection
Authors: Xiaoya Li, Yuxian Meng, Mingxin Zhou, Qinghong Han, Fei Wu and Jiwei Li
Abstract summary: We present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection. In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. We show that SAC is competitive with state-of-the-art models while significantly reducing memory cost.
Score: 51.376723069962
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While the self-attention mechanism has been widely used in a wide variety of tasks, it has the unfortunate property of a quadratic cost with respect to the input length, which makes it difficult to deal with long inputs. In this paper, we present a method for accelerating and structuring self-attentions: Sparse Adaptive Connection (SAC). In SAC, we regard the input sequence as a graph and attention operations are performed between linked nodes. In contrast with previous self-attention models with pre-defined structures (edges), the model learns to construct attention edges to improve task-specific performances. In this way, the model is able to select the most salient nodes and reduce the quadratic complexity regardless of the sequence length. Based on SAC, we show that previous variants of self-attention models are its special cases. Through extensive experiments on neural machine translation, language modeling, graph representation learning and image classification, we demonstrate SAC is competitive with state-of-the-art models while significantly reducing memory cost.

Related papers

Bridging Continuous and Discrete Tokens for Autoregressive Visual Generation [63.89280381800457]
We propose TokenBridge, which maintains the strong representation capacity of continuous tokens while preserving the modeling simplicity of discrete tokens. We introduce a dimension-wise quantization strategy that independently discretizes each feature dimension, paired with a lightweight autoregressive prediction mechanism. Our approach achieves reconstruction and generation quality on par with continuous methods while using standard categorical prediction.
arXiv Detail & Related papers (2025-03-20T17:59:59Z)
SeRpEnt: Selective Resampling for Expressive State Space Models [5.7918134313332414]
State Space Models (SSMs) have recently enjoyed a rise to prominence in the field of deep learning for sequence modeling. We show how selective time intervals in Mamba act as linear approximators of information. We propose our SeRpEnt architecture, a SSM that further exploits selectivity to compress sequences in an information-aware fashion.
arXiv Detail & Related papers (2025-01-20T20:27:50Z)
ELASTIC: Efficient Linear Attention for Sequential Interest Compression [5.689306819772134]
State-of-the-art sequential recommendation models heavily rely on transformer's attention mechanism. We propose ELASTIC, an Efficient Linear Attention for SequenTial Interest Compression. We conduct extensive experiments on various public datasets and compare it with several strong sequential recommenders.
arXiv Detail & Related papers (2024-08-18T06:41:46Z)
LongVQ: Long Sequence Modeling with Vector Quantization on Structured Memory [63.41820940103348]
Self-attention mechanism's computational cost limits its practicality for long sequences. We propose a new method called LongVQ to compress the global abstraction as a length-fixed codebook. LongVQ effectively maintains dynamic global and local patterns, which helps to complement the lack of long-range dependency issues.
arXiv Detail & Related papers (2024-04-17T08:26:34Z)
Retrieval-Enhanced Contrastive Vision-Text Models [61.783728119255365]
We propose to equip vision-text models with the ability to refine their embedding with cross-modal retrieved information from a memory at inference time. Remarkably, we show that this can be done with a light-weight, single-layer, fusion transformer on top of a frozen CLIP. Our experiments validate that our retrieval-enhanced contrastive (RECO) training improves CLIP performance substantially on several challenging fine-grained tasks.
arXiv Detail & Related papers (2023-06-12T15:52:02Z)
A Unified View of Long-Sequence Models towards Modeling Million-Scale Dependencies [0.0]
We compare existing solutions to long-sequence modeling in terms of their pure mathematical formulation. We then demonstrate that long context length does yield better performance, albeit application-dependent. Inspired by emerging sparse models of huge capacity, we propose a machine learning system for handling million-scale dependencies.
arXiv Detail & Related papers (2023-02-13T09:47:31Z)
DAE-Former: Dual Attention-guided Efficient Transformer for Medical Image Segmentation [3.9548535445908928]
We propose DAE-Former, a novel method that seeks to provide an alternative perspective by efficiently designing the self-attention mechanism. Our method outperforms state-of-the-art methods on multi-organ cardiac and skin lesion segmentation datasets without requiring pre-training weights.
arXiv Detail & Related papers (2022-12-27T14:39:39Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Graph Conditioned Sparse-Attention for Improved Source Code Understanding [0.0]
We propose the conditioning of a source code snippet with its graph modality by using the graph adjacency matrix as an attention mask for a sparse self-attention mechanism. Our model reaches state-of-the-art results in BLEU, METEOR, and ROUGE-L metrics for the code summarization task and near state-of-the-art accuracy in the variable misuse task.
arXiv Detail & Related papers (2021-12-01T17:21:55Z)
ABC: Attention with Bounded-memory Control [67.40631793251997]
We show that bounded-memory control (ABC) can be subsumed into one abstraction, attention with bounded-memory control (ABC) ABC reveals new, unexplored possibilities. First, it connects several efficient attention variants that would otherwise seem apart. Last, we present a new instance of ABC, which draws inspiration from existing ABC approaches, but replaces their memory-organizing functions with a learned, contextualized one.
arXiv Detail & Related papers (2021-10-06T03:53:25Z)
A Trainable Optimal Transport Embedding for Feature Aggregation and its Relationship to Attention [96.77554122595578]
We introduce a parametrized representation of fixed size, which embeds and then aggregates elements from a given input set according to the optimal transport plan between the set and a trainable reference. Our approach scales to large datasets and allows end-to-end training of the reference, while also providing a simple unsupervised learning mechanism with small computational cost.
arXiv Detail & Related papers (2020-06-22T08:35:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.