Recasting Self-Attention with Holographic Reduced Representations
        - URL: http://arxiv.org/abs/2305.19534v1
- Date: Wed, 31 May 2023 03:42:38 GMT
- Title: Recasting Self-Attention with Holographic Reduced Representations
- Authors: Mohammad Mahmudul Alam, Edward Raff, Stella Biderman, Tim Oates, James
  Holt
- Abstract summary: Motivated by problems in malware detection, we re-cast self-attention using the neuro-symbolic approach of Holographic Reduced Representations (HRR)
We obtain several benefits including $mathcalO(T H log H)$ time complexity, $mathcalO(T H)$ space complexity, and convergence in $10times$ fewer epochs.
Our Hrrformer achieves near state-of-the-art accuracy on LRA benchmarks and we are able to learn with just a single layer.
- Score: 31.89878931813593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   In recent years, self-attention has become the dominant paradigm for sequence
modeling in a variety of domains. However, in domains with very long sequence
lengths the $\mathcal{O}(T^2)$ memory and $\mathcal{O}(T^2 H)$ compute costs
can make using transformers infeasible. Motivated by problems in malware
detection, where sequence lengths of $T \geq 100,000$ are a roadblock to deep
learning, we re-cast self-attention using the neuro-symbolic approach of
Holographic Reduced Representations (HRR). In doing so we perform the same
high-level strategy of the standard self-attention: a set of queries matching
against a set of keys, and returning a weighted response of the values for each
key. Implemented as a ``Hrrformer'' we obtain several benefits including
$\mathcal{O}(T H \log H)$ time complexity, $\mathcal{O}(T H)$ space complexity,
and convergence in $10\times$ fewer epochs. Nevertheless, the Hrrformer
achieves near state-of-the-art accuracy on LRA benchmarks and we are able to
learn with just a single layer. Combined, these benefits make our Hrrformer the
first viable Transformer for such long malware classification sequences and up
to $280\times$ faster to train on the Long Range Arena benchmark. Code is
available at
\url{https://github.com/NeuromorphicComputationResearchProgram/Hrrformer}
 
      
        Related papers
        - ZETA: Leveraging Z-order Curves for Efficient Top-k Attention [22.90397380324185]
 We propose ZETA to enable parallel querying of past tokens for entire sequences.
ZETA matches the performance of standard attention on the synthetic textscMulti-Query Associative Recall task.
 arXiv  Detail & Related papers  (2025-01-24T15:33:05Z)
- Bridging the Gap Between Approximation and Learning via Optimal   Approximation by ReLU MLPs of Maximal Regularity [8.28720658988688]
 We show that any $(L,alpha)$-H"older function can be approximated to a uniform $mathcalO(dnd/alpha)$, of width $mathcalO(dnd/alpha)$, depth $mathcalO(log(d))$, with $mathcalO(dnd/alpha)$ and $mathcalO(dnd/alpha)$, with $mathcal
 arXiv  Detail & Related papers  (2024-09-18T22:05:07Z)
- A Training-free Sub-quadratic Cost Transformer Model Serving Framework   With Hierarchically Pruned Attention [43.211427581302715]
 We propose Hierarchically Pruned Attention (HiP) to increase context length in large language models.
HiP reduces the time complexity of the attention mechanism to $O(T log T)$ and the space complexity to $O(T)$, where $T$ is the sequence length.
We show that HiP significantly reduces both prefill and decoding latencies, as well as memory usage, while maintaining high-quality generation with minimal degradation.
 arXiv  Detail & Related papers  (2024-06-14T08:32:45Z)
- Projection by Convolution: Optimal Sample Complexity for Reinforcement   Learning in Continuous-Space MDPs [56.237917407785545]
 We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
 arXiv  Detail & Related papers  (2024-05-10T09:58:47Z)
- Efficiently Learning One-Hidden-Layer ReLU Networks via Schur
  Polynomials [50.90125395570797]
 We study the problem of PAC learning a linear combination of $k$ ReLU activations under the standard Gaussian distribution on $mathbbRd$ with respect to the square loss.
Our main result is an efficient algorithm for this learning task with sample and computational complexity $(dk/epsilon)O(k)$, whereepsilon>0$ is the target accuracy.
 arXiv  Detail & Related papers  (2023-07-24T14:37:22Z)
- Replicability in Reinforcement Learning [46.89386344741442]
 We focus on the fundamental setting of discounted MDPs with access to a generative model.
Inspired by Impagliazzo et al. [2022], we say that an RL algorithm is replicable if, with high probability, it outputs the exact same policy after two executions.
 arXiv  Detail & Related papers  (2023-05-31T05:16:23Z)
- Transformers Learn Shortcuts to Automata [52.015990420075944]
 We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
 arXiv  Detail & Related papers  (2022-10-19T17:45:48Z)
- Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
 We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
 arXiv  Detail & Related papers  (2022-10-15T09:22:22Z)
- On Submodular Contextual Bandits [92.45432756301231]
 We consider the problem of contextual bandits where actions are subsets of a ground set and mean rewards are modeled by an unknown monotone submodular function.
We show that our algorithm efficiently randomizes around local optima of estimated functions according to the Inverse Gap Weighting strategy.
 arXiv  Detail & Related papers  (2021-12-03T21:42:33Z)
- Model Selection with Near Optimal Rates for Reinforcement Learning with
  General Model Classes [27.361399036211694]
 We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
 arXiv  Detail & Related papers  (2021-07-13T05:00:38Z)
- Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and
  Costs [45.87981728307819]
 The ability to compare and align related datasets living in heterogeneous spaces plays an increasingly important role in machine learning.
The Gromov-Wasserstein (GW) formalism can help tackle this problem.
 arXiv  Detail & Related papers  (2021-06-02T12:50:56Z)
- Hybrid Stochastic-Deterministic Minibatch Proximal Gradient:
  Less-Than-Single-Pass Optimization with Nearly Optimal Generalization [83.80460802169999]
 We show that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
For loss factors, we prove that HSDMPG can attain an $mathcalObig (1/sttnbig)$ which is at the order of excess error on a learning model.
 arXiv  Detail & Related papers  (2020-09-18T02:18:44Z)
- $O(n)$ Connections are Expressive Enough: Universal Approximability of
  Sparse Transformers [71.31712741938837]
 We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
 arXiv  Detail & Related papers  (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.