How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
- URL: http://arxiv.org/abs/2407.08112v1
- Date: Thu, 11 Jul 2024 01:08:39 GMT
- Title: How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
- Authors: Jerry Huang,
- Abstract summary: Recent advances in system engineering have enabled the scaling up of model that are purported to support extended context length.
We show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed.
- Score: 0.6798775532273751
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.
Related papers
- On the Resurgence of Recurrent Models for Long Sequences -- Survey and
Research Opportunities in the Transformer Era [59.279784235147254]
This survey is aimed at providing an overview of these trends framed under the unifying umbrella of Recurrence.
It emphasizes novel research opportunities that become prominent when abandoning the idea of processing long sequences.
arXiv Detail & Related papers (2024-02-12T23:55:55Z) - No Length Left Behind: Enhancing Knowledge Tracing for Modeling
Sequences of Excessive or Insufficient Lengths [3.2687390531088414]
Knowledge tracing aims to predict students' responses to practices based on their historical question-answering behaviors.
As sequences get longer, computational costs will increase exponentially.
We propose a model called Sequence-Flexible Knowledge Tracing (SFKT)
arXiv Detail & Related papers (2023-08-07T11:30:58Z) - Exposing Attention Glitches with Flip-Flop Language Modeling [55.0688535574859]
This work identifies and analyzes the phenomenon of attention glitches in large language models.
We introduce flip-flop language modeling (FFLM), a family of synthetic benchmarks designed to probe the extrapolative behavior of neural language models.
We find that Transformer FFLMs suffer from a long tail of sporadic reasoning errors, some of which we can eliminate using various regularization techniques.
arXiv Detail & Related papers (2023-06-01T17:44:35Z) - Visual Chain of Thought: Bridging Logical Gaps with Multimodal
Infillings [61.04460792203266]
We introduce VCoT, a novel method that leverages chain-of-thought prompting with vision-language grounding to bridge the logical gaps within sequential data.
Our method uses visual guidance to generate synthetic multimodal infillings that add consistent and novel information to reduce the logical gaps for downstream tasks.
arXiv Detail & Related papers (2023-05-03T17:58:29Z) - Learning Low Dimensional State Spaces with Overparameterized Recurrent
Neural Nets [57.06026574261203]
We provide theoretical evidence for learning low-dimensional state spaces, which can also model long-term memory.
Experiments corroborate our theory, demonstrating extrapolation via learning low-dimensional state spaces with both linear and non-linear RNNs.
arXiv Detail & Related papers (2022-10-25T14:45:15Z) - Simple Local Attentions Remain Competitive for Long-Context Tasks [32.785459927278616]
Many NLP tasks require processing long contexts beyond the length limit of pretrained models.
In order to scale these models to longer text sequences, many efficient long-range attention variants have been proposed.
For each attention variant, we pretrain large-size models using the same long-doc corpus and then finetune these models for real-world long-context tasks.
arXiv Detail & Related papers (2021-12-14T07:37:58Z) - Closed-form Continuous-Depth Models [99.40335716948101]
Continuous-depth neural models rely on advanced numerical differential equation solvers.
We present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster.
arXiv Detail & Related papers (2021-06-25T22:08:51Z) - TimeSHAP: Explaining Recurrent Models through Sequence Perturbations [3.1498833540989413]
Recurrent neural networks are a standard building block in numerous machine learning domains.
The complex decision-making in these models is seen as a black-box, creating a tension between accuracy and interpretability.
In this work, we contribute to filling these gaps by presenting TimeSHAP, a model-agnostic recurrent explainer.
arXiv Detail & Related papers (2020-11-30T19:48:57Z) - Causal Expectation-Maximisation [70.45873402967297]
We show that causal inference is NP-hard even in models characterised by polytree-shaped graphs.
We introduce the causal EM algorithm to reconstruct the uncertainty about the latent variables from data about categorical manifest variables.
We argue that there appears to be an unnoticed limitation to the trending idea that counterfactual bounds can often be computed without knowledge of the structural equations.
arXiv Detail & Related papers (2020-11-04T10:25:13Z) - Neural Additive Vector Autoregression Models for Causal Discovery in
Time Series [1.160208922584163]
We propose a neural approach to causal structure learning that can discover nonlinear relationships.
We train deep neural networks that extract the (additive) Granger causal influences from the time evolution in time series.
The method achieves state-of-the-art results on various benchmark data sets for causal discovery.
arXiv Detail & Related papers (2020-10-19T12:44:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.