SMYRF: Efficient Attention using Asymmetric Clustering
- URL: http://arxiv.org/abs/2010.05315v1
- Date: Sun, 11 Oct 2020 18:49:17 GMT
- Title: SMYRF: Efficient Attention using Asymmetric Clustering
- Authors: Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis
- Abstract summary: We propose a novel type of balanced clustering algorithm to approximate attention.
SMYRF can be used as a drop-in replacement for dense attention layers without any retraining.
We show that SMYRF can be used interchangeably with dense attention before and after training.
- Score: 103.47647577048782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel type of balanced clustering algorithm to approximate
attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$,
where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive
Hashing (LSH) in a novel way by defining new Asymmetric transformations and an
adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF
is that it can be used as a drop-in replacement for dense attention layers
without any retraining. On the contrary, prior fast attention methods impose
constraints (e.g. queries and keys share the same vector representations) and
require re-training from scratch. We apply our method to pre-trained
state-of-the-art Natural Language Processing and Computer Vision models and we
report significant memory and speed benefits. Notably, SMYRF-BERT outperforms
(slightly) BERT on GLUE, while using $50\%$ less memory. We also show that
SMYRF can be used interchangeably with dense attention before and after
training. Finally, we use SMYRF to train GANs with attention in high
resolutions. Using a single TPU, we were able to scale attention to 128x128=16k
and 256x256=65k tokens on BigGAN on CelebA-HQ.
Related papers
- Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification.
We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs.
Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z) - MagicPIG: LSH Sampling for Efficient LLM Generation [41.75038064509643]
We show that TopK attention itself suffers from quality degradation in certain downstream tasks because attention is not always as sparse as expected.
We propose MagicPIG, a heterogeneous system based on Locality Sensitive Hashing (LSH)
MagicPIG significantly reduces the workload of attention while preserving high accuracy for diverse tasks.
arXiv Detail & Related papers (2024-10-21T16:44:51Z) - A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Towards Zero Memory Footprint Spiking Neural Network Training [7.4331790419913455]
Spiking Neural Networks (SNNs) process information using discrete-time events known as spikes rather than continuous values.
In this paper, we introduce an innovative framework characterized by a remarkably low memory footprint.
Our design is able to achieve a $mathbf58.65times$ reduction in memory usage compared to the current SNN node.
arXiv Detail & Related papers (2023-08-16T19:49:24Z) - SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z) - FsaNet: Frequency Self-attention for Semantic Segmentation [5.495952636982018]
We propose a new self-attention mechanism with highly reduced computational complexity, up to a linear rate.
By ablation study, we show that low frequency self-attention can achieve very close or better performance relative to full frequency.
We show that frequency self-attention requires $87.29% sim 90.04%$ less memory, $96.13% sim 98.07%$ less FLOPs, and $97.56% sim 98.18%$ in run time.
arXiv Detail & Related papers (2022-11-28T17:49:46Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Adaptive Filters and Aggregator Fusion for Efficient Graph Convolutions [11.769185588579488]
We present state-of-the-art performance with lower memory consumption and latency, along with characteristics suited to accelerator implementation.
Our proposal uses memory proportional to the number of vertices in the graph, in contrast to competing methods which require memory proportional to the number of edges.
We propose aggregator fusion, a technique to enable GNNs to significantly boost their representational power, with only a small increase in latency of 19% over standard sparse matrix multiplication.
arXiv Detail & Related papers (2021-04-03T20:54:36Z) - You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference
to ANN-Level Accuracy [51.861168222799186]
Spiking Neural Networks (SNNs) are a type of neuromorphic, or brain-inspired network.
SNNs are sparse, accessing very few weights, and typically only use addition operations instead of the more power-intensive multiply-and-accumulate operations.
In this work, we aim to overcome the limitations of TTFS-encoded neuromorphic systems.
arXiv Detail & Related papers (2020-06-03T15:55:53Z) - Hashing-based Non-Maximum Suppression for Crowded Object Detection [63.761451382081844]
We propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection.
For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy.
Experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS.
arXiv Detail & Related papers (2020-05-22T23:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.