Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
- URL: http://arxiv.org/abs/2502.01770v1
- Date: Mon, 03 Feb 2025 19:24:01 GMT
- Title: Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
- Authors: Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen,
- Abstract summary: We introduce Hamming Attention Distillation (HAD), a framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains.
We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention.
- Score: 18.469378618426294
- License:
- Abstract: Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.
Related papers
- Gated Slot Attention for Efficient Linear-Time Sequence Modeling [59.019501274074564]
Gated Slot Attention (GSA) enhances Attention with Bounded-memory-Control (ABC)
GSA incorporates a gating mechanism inspired by Gated Linear Attention (GLA)
arXiv Detail & Related papers (2024-09-11T09:49:50Z) - SLAB: Efficient Transformers with Simplified Linear Attention and Progressive Re-parameterized Batch Normalization [36.84275777364218]
This paper investigates the computational bottleneck modules of efficient transformer, i.e., normalization layers and attention modules.
LayerNorm is commonly used in transformer architectures but is not computational friendly due to statistic calculation during inference.
We propose a novel method named PRepBN to progressively replace LayerNorm with re- parameterized BatchNorm in training.
arXiv Detail & Related papers (2024-05-19T15:22:25Z) - Scene Adaptive Sparse Transformer for Event-based Object Detection [40.04162039970849]
We propose a Scene Adaptive Sparse Transformer (SAST) for event-based object detection.
SAST enables window-token co-sparsification, significantly enhancing fault tolerance and reducing computational overhead.
It outperforms all other dense and sparse networks in both performance and efficiency on two large-scale event-based object detection datasets.
arXiv Detail & Related papers (2024-04-02T12:15:25Z) - Laughing Hyena Distillery: Extracting Compact Recurrences From
Convolutions [101.08706223326928]
Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers.
In this paper, we seek to enable $mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture.
arXiv Detail & Related papers (2023-10-28T18:40:03Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Skip-Attention: Improving Vision Transformers by Paying Less Attention [55.47058516775423]
Vision computation transformers (ViTs) use expensive self-attention operations in every layer.
We propose SkipAt, a method to reuse self-attention from preceding layers to approximate attention at one or more subsequent layers.
We show the effectiveness of our method in image classification and self-supervised learning on ImageNet-1K, semantic segmentation on ADE20K, image denoising on SIDD, and video denoising on DAVIS.
arXiv Detail & Related papers (2023-01-05T18:59:52Z) - HEAT: Hardware-Efficient Automatic Tensor Decomposition for Transformer
Compression [69.36555801766762]
We propose a hardware-aware tensor decomposition framework, dubbed HEAT, that enables efficient exploration of the exponential space of possible decompositions.
We experimentally show that our hardware-aware factorized BERT variants reduce the energy-delay product by 5.7x with less than 1.1% accuracy loss.
arXiv Detail & Related papers (2022-11-30T05:31:45Z) - EcoFormer: Energy-Saving Attention with Linear Complexity [40.002608785252164]
Transformer is a transformative framework that models sequential data.
We propose a new binarization paradigm customized to high-dimensional softmax attention.
We show that EcoFormer consistently achieves comparable performance with standard attentions.
arXiv Detail & Related papers (2022-09-19T13:28:32Z) - Mesa: A Memory-saving Training Framework for Transformers [58.78933015299703]
We present Mesa, a memory-saving training framework for Transformers.
Mesa uses exact activations during forward pass while storing a low-precision version of activations to reduce memory consumption during training.
Experiments on ImageNet, CIFAR-100 and ADE20K demonstrate that Mesa can reduce half of the memory footprints during training.
arXiv Detail & Related papers (2021-11-22T11:23:01Z) - Energon: Towards Efficient Acceleration of Transformers Using Dynamic
Sparse Attention [5.495006023171481]
transformer models have revolutionized Natural Language Processing (NLP) and also show promising performance on Computer Vision (CV) tasks.
We propose Energon, an algorithm-architecture co-design approach that accelerates various transformers using dynamic sparse attention.
We demonstrate that Energon achieves $161times$ and $8.4times$ geo-mean speedup and up to $104times$ and $103times$ energy reduction.
arXiv Detail & Related papers (2021-10-18T13:42:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.