Related papers: Attamba: Attending To Multi-Token States

Attamba: Attending To Multi-Token States

URL: http://arxiv.org/abs/2411.17685v1
Date: Tue, 26 Nov 2024 18:52:06 GMT
Title: Attamba: Attending To Multi-Token States
Authors: Yash Akhauri, Safeen Huda, Mohamed S. Abdelfattah,
Abstract summary: We introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling.
Score: 6.5676809841642125
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: When predicting the next token in a sequence, vanilla transformers compute attention over all previous tokens, resulting in quadratic scaling of compute with sequence length. State-space models compress the entire sequence of tokens into a fixed-dimensional representation to improve efficiency, while other architectures achieve sub-quadratic complexity via low-rank projections or sparse attention patterns over the sequence. In this paper, we introduce Attamba, a novel architecture that uses state-space models to compress chunks of tokens and applies attention on these compressed key-value representations. We find that replacing key and value projections in a transformer with SSMs can improve model quality and enable flexible token chunking, resulting in 24% improved perplexity with transformer of similar KV-Cache and attention footprint, and ~4 times smaller KV-Cache and Attention FLOPs for 5% perplexity trade-off. Attamba can perform attention on chunked-sequences of variable length, enabling a smooth transition between quadratic and linear scaling, offering adaptable efficiency gains.

Related papers

OmniSAT: Compact Action Token, Faster Auto Regression [70.70037017501357]
We introduce an Omni Swift Action Tokenizer, which learns a compact, transferable action representation.<n>The resulting discrete tokenization shortens the training sequence by 6.8$times$, and lowers the target entropy.
arXiv Detail & Related papers (2025-10-08T03:55:24Z)
SCOUT: Toward Sub-Quadratic Attention via Segment Compression for Optimized Utility in Transformers [15.142822497807236]
We propose SCOUT, a hybrid architecture that compresses tokens locally within fixed-size segments and applies attention only over these compressed representations.<n>SCOUT retains much of the expressivity of full attention while substantially reducing the computational and memory cost.<n>We analyze SCOUT's computational and memory efficiency and evaluate it empirically on long-context language modeling and reasoning tasks.
arXiv Detail & Related papers (2025-08-31T17:08:33Z)
Efficient Pretraining Length Scaling [21.4715211093876]
We present the Parallel Hidden Decoding Transformer (textitPHD-Transformer), a novel framework that enables efficient length scaling during pre-training. textitPHD-Transformer achieves this through an innovative KV cache management strategy that distinguishes between original tokens and hidden decoding tokens.
arXiv Detail & Related papers (2025-04-21T09:41:26Z)
More Tokens, Lower Precision: Towards the Optimal Token-Precision Trade-off in KV Cache Compression [71.42818367729573]
In large language models (LLMs), the memory usage of KV cache has become a critical bottleneck during inference. The mainstream KV compression methods, including KV pruning and KV quantization, primarily focus on either token or precision dimension separately. In this paper, we comprehensively investigate the token-precision trade-off in KV cache compression.
arXiv Detail & Related papers (2024-12-17T09:20:31Z)
3D representation in 512-Byte:Variational tokenizer is the key for autoregressive 3D generation [4.221298212125194]
Variational Tokenizer (VAT) transforms unordered 3D data into compact latent tokens with an implicit hierarchy. VAT enables scalable and efficient 3D generation, outperforming existing methods in quality, efficiency, and generalization.
arXiv Detail & Related papers (2024-12-03T06:31:25Z)
LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs) Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time. We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining. Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z)
Efficient Time Series Processing for Transformers and State-Space Models through Token Merging [44.27818172708914]
Token merging has shown to considerably improve the throughput of vision transformer architectures. We introduce local merging, a domain-specific token merging algorithm that selectively combines tokens within a local neighborhood. On the recently proposed Chronos foundation model, we achieve accelerations up to 5400% with only minor accuracy degradations.
arXiv Detail & Related papers (2024-05-28T08:28:18Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference [2.8241099113277666]
"Keyformer" is an innovative inference-time approach to mitigate the challenges associated with KV cache size and memory bandwidth utilization. We evaluate Keyformer's performance across three foundational models: GPT-J, Cerebras-GPT, and MPT.
arXiv Detail & Related papers (2024-03-14T02:42:42Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
PSViT: Better Vision Transformer via Token Pooling and Attention Sharing [114.8051035856023]
We propose a PSViT: a ViT with token Pooling and attention Sharing to reduce the redundancy. Experimental results show that the proposed scheme can achieve up to 6.6% accuracy improvement in ImageNet classification.
arXiv Detail & Related papers (2021-08-07T11:30:54Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity. We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention. An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.