SMYRF: Efficient Attention using Asymmetric Clustering
- URL: http://arxiv.org/abs/2010.05315v1
- Date: Sun, 11 Oct 2020 18:49:17 GMT
- Title: SMYRF: Efficient Attention using Asymmetric Clustering
- Authors: Giannis Daras, Nikita Kitaev, Augustus Odena, Alexandros G. Dimakis
- Abstract summary: We propose a novel type of balanced clustering algorithm to approximate attention.
SMYRF can be used as a drop-in replacement for dense attention layers without any retraining.
We show that SMYRF can be used interchangeably with dense attention before and after training.
- Score: 103.47647577048782
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel type of balanced clustering algorithm to approximate
attention. Attention complexity is reduced from $O(N^2)$ to $O(N \log N)$,
where $N$ is the sequence length. Our algorithm, SMYRF, uses Locality Sensitive
Hashing (LSH) in a novel way by defining new Asymmetric transformations and an
adaptive scheme that produces balanced clusters. The biggest advantage of SMYRF
is that it can be used as a drop-in replacement for dense attention layers
without any retraining. On the contrary, prior fast attention methods impose
constraints (e.g. queries and keys share the same vector representations) and
require re-training from scratch. We apply our method to pre-trained
state-of-the-art Natural Language Processing and Computer Vision models and we
report significant memory and speed benefits. Notably, SMYRF-BERT outperforms
(slightly) BERT on GLUE, while using $50\%$ less memory. We also show that
SMYRF can be used interchangeably with dense attention before and after
training. Finally, we use SMYRF to train GANs with attention in high
resolutions. Using a single TPU, we were able to scale attention to 128x128=16k
and 256x256=65k tokens on BigGAN on CelebA-HQ.
Related papers
- Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Reinforcement Learning with Fast and Forgetful Memory [10.087126455388276]
We introduce Fast and Forgetful Memory, an algorithm-agnostic memory model designed specifically for Reinforcement Learning (RL)
Our approach constrains the model search space via strong structural priors inspired by computational psychology.
Fast and Forgetful Memory exhibits training speeds two orders of magnitude faster than recurrent neural networks (RNNs)
arXiv Detail & Related papers (2023-10-06T09:56:26Z) - Towards Zero Memory Footprint Spiking Neural Network Training [7.4331790419913455]
Spiking Neural Networks (SNNs) process information using discrete-time events known as spikes rather than continuous values.
In this paper, we introduce an innovative framework characterized by a remarkably low memory footprint.
Our design is able to achieve a $mathbf58.65times$ reduction in memory usage compared to the current SNN node.
arXiv Detail & Related papers (2023-08-16T19:49:24Z) - SKI to go Faster: Accelerating Toeplitz Neural Networks via Asymmetric
Kernels [69.47358238222586]
Toeplitz Neural Networks (TNNs) are a recent sequence model with impressive results.
We aim to reduce O(n) computational complexity and O(n) relative positional encoder (RPE) multi-layer perceptron (MLP) and decay bias calls.
For bidirectional models, this motivates a sparse plus low-rank Toeplitz matrix decomposition.
arXiv Detail & Related papers (2023-05-15T21:25:35Z) - FsaNet: Frequency Self-attention for Semantic Segmentation [5.495952636982018]
We propose a new self-attention mechanism with highly reduced computational complexity, up to a linear rate.
By ablation study, we show that low frequency self-attention can achieve very close or better performance relative to full frequency.
We show that frequency self-attention requires $87.29% sim 90.04%$ less memory, $96.13% sim 98.07%$ less FLOPs, and $97.56% sim 98.18%$ in run time.
arXiv Detail & Related papers (2022-11-28T17:49:46Z) - Training Your Sparse Neural Network Better with Any Mask [106.134361318518]
Pruning large neural networks to create high-quality, independently trainable sparse masks is desirable.
In this paper we demonstrate an alternative opportunity: one can customize the sparse training techniques to deviate from the default dense network training protocols.
Our new sparse training recipe is generally applicable to improving training from scratch with various sparse masks.
arXiv Detail & Related papers (2022-06-26T00:37:33Z) - Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike
Hybrid Input Encoding [5.725845886457027]
Spiking Neural Networks (SNNs) provide higher computational efficiency in event driven neuromorphic hardware.
SNNs suffer from high inference latency, resulting from inefficient input encoding and training techniques.
This paper presents a training framework for low-latency energy-efficient SNNs.
arXiv Detail & Related papers (2021-07-26T06:16:40Z) - Adaptive Filters and Aggregator Fusion for Efficient Graph Convolutions [11.769185588579488]
We present state-of-the-art performance with lower memory consumption and latency, along with characteristics suited to accelerator implementation.
Our proposal uses memory proportional to the number of vertices in the graph, in contrast to competing methods which require memory proportional to the number of edges.
We propose aggregator fusion, a technique to enable GNNs to significantly boost their representational power, with only a small increase in latency of 19% over standard sparse matrix multiplication.
arXiv Detail & Related papers (2021-04-03T20:54:36Z) - You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference
to ANN-Level Accuracy [51.861168222799186]
Spiking Neural Networks (SNNs) are a type of neuromorphic, or brain-inspired network.
SNNs are sparse, accessing very few weights, and typically only use addition operations instead of the more power-intensive multiply-and-accumulate operations.
In this work, we aim to overcome the limitations of TTFS-encoded neuromorphic systems.
arXiv Detail & Related papers (2020-06-03T15:55:53Z) - Hashing-based Non-Maximum Suppression for Crowded Object Detection [63.761451382081844]
We propose an algorithm, named hashing-based non-maximum suppression (HNMS) to efficiently suppress the non-maximum boxes for object detection.
For two-stage detectors, we replace NMS in region proposal network with HNMS, and observe significant speed-up with comparable accuracy.
Experiments are conducted on CARPK, SKU-110K, CrowdHuman datasets to demonstrate the efficiency and effectiveness of HNMS.
arXiv Detail & Related papers (2020-05-22T23:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.