Related papers: SMYRF: Efficient Attention using Asymmetric Clustering

SMYRF: Efficient Attention using Asymmetric Clustering

URL: http://arxiv.org/abs/2010.05315v1
Date: Sun, 11 Oct 2020 18:49:17 GMT
Title: SMYRF: Efficient Attention using Asymmetric Clustering
Authors: Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis
Abstract summary: We propose a novel type of balanced clustering algorithm to approximate attention. SMYRF can be used as a drop-in replacement for dense attention layers without any retraining. We show that SMYRF can be used interchangeably with dense attention before and after training.
Score: 103.47647577048782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a novel type of balanced clustering algorithm to approximate attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$, where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive Hashing (LSH) in a novel way by defining new Asymmetric transformations and an adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF is that it can be used as a drop-in replacement for dense attention layers without any retraining. On the contrary, prior fast attention methods impose constraints (e.g. queries and keys share the same vector representations) and require re-training from scratch. We apply our method to pre-trained state-of-the-art Natural Language Processing and Computer Vision models and we report significant memory and speed benefits. Notably, SMYRF-BERT outperforms (slightly) BERT on GLUE, while using $50\%$ less memory. We also show that SMYRF can be used interchangeably with dense attention before and after training. Finally, we use SMYRF to train GANs with attention in high resolutions. Using a single TPU, we were able to scale attention to 128x128=16k and 256x256=65k tokens on BigGAN on CelebA-HQ.

Related papers

Mixture of Sparse Attention: Content-Based Learnable Sparse Attention via Expert-Choice Routing [30.941881811797515]
We present Mixture of Sparse Attention (MoSA), a novel approach inspired by Mixture of Experts (MoE) with expert choice routing. MoSA dynamically selects tokens for each attention head, allowing arbitrary sparse attention patterns. We show that MoSA is the only one that can outperform the dense baseline, sometimes with up to 27% better perplexity for an identical compute budget.
arXiv Detail & Related papers (2025-05-01T05:22:11Z)
Optimal Gradient Checkpointing for Sparse and Recurrent Architectures using Off-Chip Memory [0.8321953606016751]
We introduce memory-efficient gradient checkpointing strategies tailored for the general class of sparse RNNs and Spiking Neural Networks. We find that Double Checkpointing emerges as the most effective method, optimizing the use of local memory resources while minimizing recomputation overhead.
arXiv Detail & Related papers (2024-12-16T14:23:31Z)
Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification. We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs. Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z)
MagicPIG: LSH Sampling for Efficient LLM Generation [41.75038064509643]
We show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected. We propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH) MagicPIG significantly reduces the workload of attention while preserving high accuracy for diverse tasks.
arXiv Detail & Related papers (2024-10-21T16:44:51Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models. HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length. We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z)
Towards Zero Memory Footprint Spiking Neural Network Training [7.4331790419913455]
Spiking Neural Networks (SNNs) process information using discrete-time events known as spikes rather than continuous values. In this paper, we introduce an innovative framework characterized by a remarkably low memory footprint. Our design is able to achieve a $mathbf58.65times$ reduction in memory usage compared to the current SNN node.
arXiv Detail & Related papers (2023-08-16T19:49:24Z)
SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results. We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls. For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z)
FsaNet: Frequency Self-attention for Semantic Segmentation [5.495952636982018]
We propose a new self-attention mechanism with highly reduced computational complexity, up to a linear rate. By ablation study, we show that low frequency self-attention can achieve very close or better performance relative to full frequency. We show that frequency self-attention requires $87.29% sim 90.04%$ less memory, $96.13% sim 98.07%$ less FLOPs, and $97.56% sim 98.18%$ in run time.
arXiv Detail & Related papers (2022-11-28T17:49:46Z)
Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable. In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols. Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z)
Adaptive Filters and Aggregator Fusion for Efficient Graph Convolutions [11.769185588579488]
We present state-of-the-art performance with lower memory consumption and latency, along with characteristics suited to accelerator implementation. Our proposal uses memory proportional to the number of vertices in the graph, in contrast to competing methods which require memory proportional to the number of edges. We propose aggregator fusion, a technique to enable GNNs to significantly boost their representational power, with only a small increase in latency of 19% over standard sparse matrix multiplication.
arXiv Detail & Related papers (2021-04-03T20:54:36Z)
You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference to ANN-Level Accuracy [51.861168222799186]
Spiking Neural Networks (SNNs) are a type of neuromorphic, or brain-inspired network. SNNs are sparse, accessing very few weights, and typically only use addition operations instead of the more power-intensive multiply-and-accumulate operations. In this work, we aim to overcome the limitations of TTFS-encoded neuromorphic systems.
arXiv Detail & Related papers (2020-06-03T15:55:53Z)
Hashing-based Non-Maximum Suppression for Crowded Object Detection [63.761451382081844]
We propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection. For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy. Experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS.
arXiv Detail & Related papers (2020-05-22T23:45:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.