Recasting Self-Attention with Holographic Reduced Representations
- URL: http://arxiv.org/abs/2305.19534v1
- Date: Wed, 31 May 2023 03:42:38 GMT
- Title: Recasting Self-Attention with Holographic Reduced Representations
- Authors: Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James
Holt
- Abstract summary: Motivated by problems in malware detection, we re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR)
We obtain several benefits including $mathcalO(T H log H)$ time complexity, $mathcalO(T H)$ space complexity, and convergence in $10times$ fewer epochs.
Our Hrrformer achieves near state-of-the-art accuracy on LRA benchmarks and we are able to learn with just a single layer.
- Score: 31.89878931813593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, self-attention has become the dominant paradigm for sequence
modeling in a variety of domains. However, in domains with very long sequence
lengths the $\mathcal{O}(T^2)$ memory and $\mathcal{O}(T^2 H)$ compute costs
can make using transformers infeasible. Motivated by problems in malware
detection, where sequence lengths of $T \geq 100,000$ are a roadblock to deep
learning, we re-cast self-attention using the neuro-symbolic approach of
Holographic Reduced Representations (HRR). In doing so we perform the same
high-level strategy of the standard self-attention: a set of queries matching
against a set of keys, and returning a weighted response of the values for each
key. Implemented as a ``Hrrformer'' we obtain several benefits including
$\mathcal{O}(T H \log H)$ time complexity, $\mathcal{O}(T H)$ space complexity,
and convergence in $10\times$ fewer epochs. Nevertheless, the Hrrformer
achieves near state-of-the-art accuracy on LRA benchmarks and we are able to
learn with just a single layer. Combined, these benefits make our Hrrformer the
first viable Transformer for such long malware classification sequences and up
to $280\times$ faster to train on the Long Range Arena benchmark. Code is
available at
\url{https://github.com/NeuromorphicComputationResearchProgram/Hrrformer}
Related papers
- A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
arXiv Detail & Related papers (2024-06-14T08:32:45Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
Polynomials [50.90125395570797]
We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
arXiv Detail & Related papers (2023-07-24T14:37:22Z) - Replicability in Reinforcement Learning [46.89386344741442]
We focus on the fundamental setting of discounted MDPs with access to a generative model.
Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions.
arXiv Detail & Related papers (2023-05-31T05:16:23Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - On Submodular Contextual Bandits [92.45432756301231]
We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
arXiv Detail & Related papers (2021-12-03T21:42:33Z) - Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z) - Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and
Costs [45.87981728307819]
The ability to compare and align related datasets living in heterogeneous spaces plays an increasingly important role in machine learning.
The Gromov-Wasserstein (GW) formalism can help tackle this problem.
arXiv Detail & Related papers (2021-06-02T12:50:56Z) - Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
arXiv Detail & Related papers (2020-09-18T02:18:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.