Related papers: Test-time regression: a unifying framework for designing sequence models with associative memory

Test-time regression: a unifying framework for designing sequence models with associative memory

URL: http://arxiv.org/abs/2501.12352v1
Date: Tue, 21 Jan 2025 18:32:31 GMT
Title: Test-time regression: a unifying framework for designing sequence models with associative memory
Authors: Ke Alexander Wang, Jiaxin Shi, Emily B. Fox,
Abstract summary: We show that effective sequence models must be able to perform associative recall.<n>Our key insight is that memorizing input tokens through an associative memory is equivalent to performing regression at test-time.<n>We show numerous recent architectures -- including linear attention models, their gated variants, state-space models, online learners, and softmax attention -- emerge naturally as specific approaches to test-time regression.
Score: 24.915262407519876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequences provide a remarkably general way to represent and process information. This powerful abstraction has placed sequence modeling at the center of modern deep learning applications, inspiring numerous architectures from transformers to recurrent networks. While this fragmented development has yielded powerful models, it has left us without a unified framework to understand their fundamental similarities and explain their effectiveness. We present a unifying framework motivated by an empirical observation: effective sequence models must be able to perform associative recall. Our key insight is that memorizing input tokens through an associative memory is equivalent to performing regression at test-time. This regression-memory correspondence provides a framework for deriving sequence models that can perform associative recall, offering a systematic lens to understand seemingly ad-hoc architectural choices. We show numerous recent architectures -- including linear attention models, their gated variants, state-space models, online learners, and softmax attention -- emerge naturally as specific approaches to test-time regression. Each architecture corresponds to three design choices: the relative importance of each association, the regressor function class, and the optimization algorithm. This connection leads to new understanding: we provide theoretical justification for QKNorm in softmax attention, and we motivate higher-order generalizations of softmax attention. Beyond unification, our work unlocks decades of rich statistical tools that can guide future development of more powerful yet principled sequence models.

Related papers

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization [26.3595298111209]
We reconceptualize neural architectures as associative memory modules that learn a mapping of keys and values using an internal objective, referred to attentional bias. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
arXiv Detail & Related papers (2025-04-17T17:59:33Z)
A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations [1.6385815610837167]
We focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data. We develop a candidate computational model of how humans detect and understand such structural repeats.
arXiv Detail & Related papers (2025-04-14T10:08:28Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework [11.804368618793273]
Symbolic Regression holds great potential for uncovering underlying mathematical and physical relationships from observed data. Current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge. We propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression.
arXiv Detail & Related papers (2025-02-05T06:26:49Z)
ViSymRe: Vision-guided Multimodal Symbolic Regression [12.486013697763228]
We propose a vision-guided multimodal symbolic regression model called ViSymRe.<n>It integrates vision, symbol and numeric to enhance symbolic regression.<n>It emphasizes the simplicity and structural rationality of the equations rather than merely numerical fitting.
arXiv Detail & Related papers (2024-12-15T10:05:31Z)
Enhanced Transformer architecture for in-context learning of dynamical systems [0.3749861135832073]
In this paper, we enhance the original meta-modeling framework through three key innovations. The efficacy of these modifications is demonstrated through a numerical example focusing on the Wiener-Hammerstein system class.
arXiv Detail & Related papers (2024-10-04T10:05:15Z)
State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era [59.279784235147254]
This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time.
arXiv Detail & Related papers (2024-06-13T12:51:22Z)
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks [50.29356570858905]
We introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation.<n>We provide principled comparisons between softmax attention and other model classes, discussing the theoretical conditions under which softmax attention can be approximated.<n>This shows the DSF's potential to guide the systematic development of future more efficient and scalable foundation models.
arXiv Detail & Related papers (2024-05-24T17:19:57Z)
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling [4.190836962132713]
This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
arXiv Detail & Related papers (2024-02-28T17:36:45Z)
Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework [11.804368618793273]
Symbolic Regression holds great potential for uncovering underlying mathematical and physical relationships from observed data. Current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge. We propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression.
arXiv Detail & Related papers (2024-02-07T22:53:54Z)
Deep Generative Symbolic Regression [83.04219479605801]
Symbolic regression aims to discover concise closed-form mathematical equations from data. Existing methods, ranging from search to reinforcement learning, fail to scale with the number of input variables. We propose an instantiation of our framework, Deep Generative Symbolic Regression.
arXiv Detail & Related papers (2023-12-30T17:05:31Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Robust Graph Representation Learning via Predictive Coding [46.22695915912123]
Predictive coding is a message-passing framework initially developed to model information processing in the brain. In this work, we build models that rely on the message-passing rule of predictive coding. We show that the proposed models are comparable to standard ones in terms of performance in both inductive and transductive tasks.
arXiv Detail & Related papers (2022-12-09T03:58:22Z)
Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning. Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z)
Towards a Predictive Processing Implementation of the Common Model of Cognition [79.63867412771461]
We describe an implementation of the common model of cognition grounded in neural generative coding and holographic associative memory. The proposed system creates the groundwork for developing agents that learn continually from diverse tasks as well as model human performance at larger scales.
arXiv Detail & Related papers (2021-05-15T22:55:23Z)
Self-Reflective Variational Autoencoder [21.054722609128525]
Variational Autoencoder (VAE) is a powerful framework for learning latent variable generative models. We introduce a solution, which we call self-reflective inference. We empirically demonstrate the clear advantages of matching the variational posterior to the exact posterior.
arXiv Detail & Related papers (2020-07-10T05:05:26Z)
SEEK: Segmented Embedding of Knowledge Graphs [77.5307592941209]
We propose a lightweight modeling framework that can achieve highly competitive relational expressiveness without increasing the model complexity. Our framework focuses on the design of scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions; 2) preserving both symmetry and antisymmetry properties of relations.
arXiv Detail & Related papers (2020-05-02T15:15:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.