How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
- URL: http://arxiv.org/abs/2407.08112v1
- Date: Thu, 11 Jul 2024 01:08:39 GMT
- Title: How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
- Authors: Jerry Huang,
- Abstract summary: Recent advances in system engineering have enabled the scaling up of model that are purported to support extended context length.
We show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed.
- Score: 0.6798775532273751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.
Related papers
- Understanding and Improving Length Generalization in Recurrent Models [16.642157805072042]
recurrent models can process arbitrarily long sequences, but their performance sometimes drops considerably beyond their training context lengths.<n>We show that models fail to length generalize when during training they are only exposed to a limited subset of the distribution of all attainable states.<n>We investigate simple training interventions that aim to increase the coverage of the states that the model is trained on.
arXiv Detail & Related papers (2025-07-03T16:45:50Z) - Sequential-Parallel Duality in Prefix Scannable Models [68.39855814099997]
Recent developments have given rise to various models, such as Gated Linear Attention (GLA) and Mamba.<n>This raises a natural question: can we characterize the full class of neural sequence models that support near-constant-time parallel evaluation and linear-time, constant-space sequential inference?
arXiv Detail & Related papers (2025-06-12T17:32:02Z) - StateSpaceDiffuser: Bringing Long Context to Diffusion World Models [53.05314852577144]
We introduce StateSpaceDiffuser, where a diffusion model is enabled to perform long-context tasks by integrating features from a state-space model.<n>This design restores long-term memory while preserving the high-fidelity synthesis of diffusion models.<n>Experiments show that StateSpaceDiffuser significantly outperforms a strong diffusion-only baseline.
arXiv Detail & Related papers (2025-05-28T11:27:54Z) - MANTA: Diffusion Mamba for Efficient and Effective Stochastic Long-Term Dense Anticipation [17.4088244981231]
Long-term dense action anticipation is challenging since it requires predicting actions and their durations several minutes into the future.
We propose a novel MANTA (MAmba for ANTicipation) network to enable effective long-term temporal modelling.
Our approach achieves state-of-the-art results on three datasets - Breakfast, 50Salads, and Assembly101.
arXiv Detail & Related papers (2025-01-15T14:46:44Z) - Oscillatory State-Space Models [61.923849241099184]
We propose Lineary State-Space models (LinOSS) for efficiently learning on long sequences.
A stable discretization, integrated over time using fast associative parallel scans, yields the proposed state-space model.
We show that LinOSS is universal, i.e., it can approximate any continuous and causal operator mapping between time-varying functions.
arXiv Detail & Related papers (2024-10-04T22:00:13Z) - State space models, emergence, and ergodicity: How many parameters are needed for stable predictions? [28.65576793023554]
We show that tasks exhibiting substantial long-range correlation require a certain critical number of parameters.
We also investigate the role of the learner's parametrization and consider a simple version of a linear dynamical system with hidden state.
arXiv Detail & Related papers (2024-09-20T11:39:37Z) - On the Resurgence of Recurrent Models for Long Sequences -- Survey and
Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence.
It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z) - Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models.
We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models.
We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Simple Local Attentions Remain Competitive for Long-Context Tasks [32.785459927278616]
Many NLP tasks require processing long contexts beyond the length limit of pretrained models.
In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed.
For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
arXiv Detail & Related papers (2021-12-14T07:37:58Z) - TimeSHAP: Explaining Recurrent Models through Sequence Perturbations [3.1498833540989413]
Recurrent neural networks are a standard building block in numerous machine learning domains.
The complex decision-making in these models is seen as a black-box, creating a tension between accuracy and interpretability.
In this work, we contribute to filling these gaps by presenting TimeSHAP, a model-agnostic recurrent explainer.
arXiv Detail & Related papers (2020-11-30T19:48:57Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z) - Neural Additive Vector Autoregression Models for Causal Discovery in
Time Series [1.160208922584163]
We propose a neural approach to causal structure learning that can discover nonlinear relationships.
We train deep neural networks that extract the (additive) Granger causal influences from the time evolution in time series.
The method achieves state-of-the-art results on various benchmark data sets for causal discovery.
arXiv Detail & Related papers (2020-10-19T12:44:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.