Simplifying and Understanding State Space Models with Diagonal Linear
RNNs
- URL: http://arxiv.org/abs/2212.00768v3
- Date: Tue, 14 Nov 2023 16:52:48 GMT
- Title: Simplifying and Understanding State Space Models with Diagonal Linear
RNNs
- Authors: Ankit Gupta, Harsh Mehta, Jonathan Berant
- Abstract summary: This work disposes of the discretization step, and proposes a model based on vanilla Diagonal Linear RNNs.
We empirically show that, despite being conceptually much simpler, $mathrmDLR$ is as performant as previously-proposed SSMs.
We also characterize the expressivity of SSMs and attention-based models via a suite of $13$ synthetic sequence-to-sequence tasks.
- Score: 56.33053691749856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequence models based on linear state spaces (SSMs) have recently emerged as
a promising choice of architecture for modeling long range dependencies across
various modalities. However, they invariably rely on discretization of a
continuous state space, which complicates their presentation and understanding.
In this work, we dispose of the discretization step, and propose a model based
on vanilla Diagonal Linear RNNs ($\mathrm{DLR}$). We empirically show that,
despite being conceptually much simpler, $\mathrm{DLR}$ is as performant as
previously-proposed SSMs on a variety of tasks and benchmarks including Long
Range Arena and raw speech classification. Moreover, we characterize the
expressivity of SSMs (including $\mathrm{DLR}$) and attention-based models via
a suite of $13$ synthetic sequence-to-sequence tasks involving interactions
over tens of thousands of tokens, ranging from simple operations, such as
shifting an input sequence, to detecting co-dependent visual features over long
spatial ranges in flattened images. We find that while SSMs report near-perfect
performance on tasks that can be modeled via $\textit{few}$ convolutional
kernels, they struggle on tasks requiring $\textit{many}$ such kernels and
especially when the desired sequence manipulation is
$\textit{context-dependent}$. Despite these limitations, $\mathrm{DLR}$ reaches
high performance on two higher-order reasoning tasks $\mathrm{ListOpsSubTrees}$
and $\mathrm{PathfinderSegmentation}\text{-}\mathrm{256}$ with input lengths
$8K$ and $65K$ respectively, and gives encouraging performance on
$\mathrm{PathfinderSegmentation}\text{-}\mathrm{512}$ with input length $262K$
for which attention is not a viable choice.
Related papers
- Learning sum of diverse features: computational hardness and efficient gradient-based training for ridge combinations [40.77319247558742]
We study the computational complexity of learning a target function $f_*:mathbbRdtomathbbR$ with additive structure.
We prove that a large subset of $f_*$ can be efficiently learned by gradient training of a two-layer neural network.
arXiv Detail & Related papers (2024-06-17T17:59:17Z) - Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Reward-Mixing MDPs with a Few Latent Contexts are Learnable [75.17357040707347]
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs)
Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model.
arXiv Detail & Related papers (2022-10-05T22:52:00Z) - Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
We consider the problem of learning a latent $k$-vertex simplex $KsubsetmathbbRdtimes n$, given access to $AinmathbbRdtimes n$.
We show that the dependence on $k$ in the running time is unnecessary given a natural assumption about the mass of the top $k$ singular values of $A$.
arXiv Detail & Related papers (2021-05-17T16:40:48Z) - Convergence of Sparse Variational Inference in Gaussian Processes
Regression [29.636483122130027]
We show that a method with an overall computational cost of $mathcalO(log N)2D(loglog N)2)$ can be used to perform inference.
arXiv Detail & Related papers (2020-08-01T19:23:34Z) - $Q$-learning with Logarithmic Regret [60.24952657636464]
We prove that an optimistic $Q$-learning enjoys a $mathcalOleft(fracSAcdot mathrmpolyleft(Hright)Delta_minlogleft(SATright)right)$ cumulative regret bound, where $S$ is the number of states, $A$ is the number of actions, $H$ is the planning horizon, $T$ is the total number of steps, and $Delta_min$ is the minimum sub-optimality gap.
arXiv Detail & Related papers (2020-06-16T13:01:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.