Related papers: Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

URL: http://arxiv.org/abs/2502.01770v1
Date: Mon, 03 Feb 2025 19:24:01 GMT
Title: Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
Authors: Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen,
Abstract summary: We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.<n>We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
Score: 18.469378618426294
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.

Related papers

Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Progressive Sparse Attention: Algorithm and System Co-design for Efficient Attention in LLM Serving [10.835583587146274]
This paper presents PSA, a $underlineP$rogressive $underlineS$parse $underlineA$ttention mechanism. It integrates algorithmic innovations with system co-design to achieve both high inference accuracy and improved efficiency in large language models. Experiments demonstrate that PSA reduces KV cache usage for attention computation by up to 2.4$times$ and 8.8$times$, and increases end-to-end serving throughput by up to 1.4$times$ and 2.0$times$.
arXiv Detail & Related papers (2025-03-01T07:56:42Z)
Gated Slot Attention for Efficient Linear-Time Sequence Modeling [59.019501274074564]
Gated Slot Attention (GSA) enhances Attention with Bounded-memory-Control (ABC) GSA incorporates a gating mechanism inspired by Gated Linear Attention (GLA)
arXiv Detail & Related papers (2024-09-11T09:49:50Z)
SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules. LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference. We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z)
Scene Adaptive Sparse Transformer for Event-based Object Detection [40.04162039970849]
We propose a Scene Adaptive Sparse Transformer (SAST) for event-based object detection. SAST enables window-token co-sparsification, significantly enhancing fault tolerance and reducing computational overhead. It outperforms all other dense and sparse networks in both performance and efficiency on two large-scale event-based object detection datasets.
arXiv Detail & Related papers (2024-04-02T12:15:25Z)
Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z)
CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation. Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens. Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z)
Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer. We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers. We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z)
HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions. We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z)
Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers. Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training. Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z)
Energon: Towards Efficient Acceleration of Transformers Using Dynamic Sparse Attention [5.495006023171481]
transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks. We propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention. We demonstrate that Energon achieves $161times$ and $8.4times$ geo-mean speedup and up to $104times$ and $103times$ energy reduction.
arXiv Detail & Related papers (2021-10-18T13:42:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.