Related papers: Test-time regression: a unifying framework for designing sequence models with associative memory

Test-time regression: a unifying framework for designing sequence models with associative memory

URL: http://arxiv.org/abs/2501.12352v2
Date: Tue, 29 Apr 2025 17:47:20 GMT
Title: Test-time regression: a unifying framework for designing sequence models with associative memory
Authors: Ke Alexander Wang, Jiaxin Shi, Emily B. Fox,
Abstract summary: We introduce a unifying framework to understand and derive sequence models.<n>We formalize associative recall as a two-step process, memorization and retrieval, casting as a regression problem.<n>Our work bridges sequence modeling with classic regression methods, paving the way for developing more powerful and theoretically principled architectures.
Score: 24.915262407519876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequence models lie at the heart of modern deep learning. However, rapid advancements have produced a diversity of seemingly unrelated architectures, such as Transformers and recurrent alternatives. In this paper, we introduce a unifying framework to understand and derive these sequence models, inspired by the empirical importance of associative recall, the capability to retrieve contextually relevant tokens. We formalize associative recall as a two-step process, memorization and retrieval, casting memorization as a regression problem. Layers that combine these two steps perform associative recall via ``test-time regression'' over its input tokens. Prominent layers, including linear attention, state-space models, fast-weight programmers, online learners, and softmax attention, arise as special cases defined by three design choices: the regression weights, the regressor function class, and the test-time optimization algorithm. Our approach clarifies how linear attention fails to capture inter-token correlations and offers a mathematical justification for the empirical effectiveness of query-key normalization in softmax attention. Further, it illuminates unexplored regions within the design space, which we use to derive novel higher-order generalizations of softmax attention. Beyond unification, our work bridges sequence modeling with classic regression methods, a field with extensive literature, paving the way for developing more powerful and theoretically principled architectures.

Related papers

It's All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization [26.3595298111209]
We reconceptualize neural architectures as associative memory modules that learn a mapping of keys and values using an internal objective, referred to attentional bias. We present three novel sequence models-Moneta, Yaad, and Memora-that go beyond the power of existing linear RNNs while maintaining a fast parallelizable training process. For example, certain instances of Miras achieve exceptional performance in special tasks such as language modeling, commonsense reasoning, and recall intensive tasks, even outperforming Transformers and other modern linear recurrent models.
arXiv Detail & Related papers (2025-04-17T17:59:33Z)
A Computational Cognitive Model for Processing Repetitions of Hierarchical Relations [1.6385815610837167]
We focus on structural repeats, patterns that arise from the repetition of hierarchical relations within sequential data. We develop a candidate computational model of how humans detect and understand such structural repeats.
arXiv Detail & Related papers (2025-04-14T10:08:28Z)
In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention [52.159541540613915]
We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution.
arXiv Detail & Related papers (2025-03-17T02:00:49Z)
Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework [11.804368618793273]
Symbolic Regression holds great potential for uncovering underlying mathematical and physical relationships from observed data. Current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge. We propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression.
arXiv Detail & Related papers (2025-02-05T06:26:49Z)
ViSymRe: Vision-guided Multimodal Symbolic Regression [12.486013697763228]
We propose a vision-guided multimodal symbolic regression model called ViSymRe.<n>It integrates vision, symbol and numeric to enhance symbolic regression.<n>It emphasizes the simplicity and structural rationality of the equations rather than merely numerical fitting.
arXiv Detail & Related papers (2024-12-15T10:05:31Z)
Enhanced Transformer architecture for in-context learning of dynamical systems [0.3749861135832073]
In this paper, we enhance the original meta-modeling framework through three key innovations. The efficacy of these modifications is demonstrated through a numerical example focusing on the Wiener-Hammerstein system class.
arXiv Detail & Related papers (2024-10-04T10:05:15Z)
State-Space Modeling in Long Sequence Processing: A Survey on Recurrence in the Transformer Era [59.279784235147254]
This survey provides an in-depth summary of the latest approaches that are based on recurrent models for sequential data processing. The emerging picture suggests that there is room for thinking of novel routes, constituted by learning algorithms which depart from the standard Backpropagation Through Time.
arXiv Detail & Related papers (2024-06-13T12:51:22Z)
Understanding the differences in Foundation Models: Attention, State Space Models, and Recurrent Neural Networks [50.29356570858905]
We introduce the Dynamical Systems Framework (DSF), which allows a principled investigation of all these architectures in a common representation.<n>We provide principled comparisons between softmax attention and other model classes, discussing the theoretical conditions under which softmax attention can be approximated.<n>This shows the DSF's potential to guide the systematic development of future more efficient and scalable foundation models.
arXiv Detail & Related papers (2024-05-24T17:19:57Z)
Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling [4.190836962132713]
This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms. At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence. We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
arXiv Detail & Related papers (2024-02-28T17:36:45Z)
Interactive Symbolic Regression through Offline Reinforcement Learning: A Co-Design Framework [11.804368618793273]
Symbolic Regression holds great potential for uncovering underlying mathematical and physical relationships from observed data. Current state-of-the-art approaches typically do not consider the integration of domain experts' prior knowledge. We propose the Symbolic Q-network (Sym-Q), an advanced interactive framework for large-scale symbolic regression.
arXiv Detail & Related papers (2024-02-07T22:53:54Z)
Deep Generative Symbolic Regression [83.04219479605801]
Symbolic regression aims to discover concise closed-form mathematical equations from data. Existing methods, ranging from search to reinforcement learning, fail to scale with the number of input variables. We propose an instantiation of our framework, Deep Generative Symbolic Regression.
arXiv Detail & Related papers (2023-12-30T17:05:31Z)
Real-World Compositional Generalization with Disentangled Sequence-to-Sequence Learning [81.24269148865555]
A recently proposed Disentangled sequence-to-sequence model (Dangle) shows promising generalization capability. We introduce two key modifications to this model which encourage more disentangled representations and improve its compute and memory efficiency. Specifically, instead of adaptively re-encoding source keys and values at each time step, we disentangle their representations and only re-encode keys periodically.
arXiv Detail & Related papers (2022-12-12T15:40:30Z)
Robust Graph Representation Learning via Predictive Coding [46.22695915912123]
Predictive coding is a message-passing framework initially developed to model information processing in the brain. In this work, we build models that rely on the message-passing rule of predictive coding. We show that the proposed models are comparable to standard ones in terms of performance in both inductive and transductive tasks.
arXiv Detail & Related papers (2022-12-09T03:58:22Z)
Learning Sequence Representations by Non-local Recurrent Neural Memory [61.65105481899744]
We propose a Non-local Recurrent Neural Memory (NRNM) for supervised sequence representation learning. Our model is able to capture long-range dependencies and latent high-level features can be distilled by our model. Our model compares favorably against other state-of-the-art methods specifically designed for each of these sequence applications.
arXiv Detail & Related papers (2022-07-20T07:26:15Z)
Towards a Predictive Processing Implementation of the Common Model of Cognition [79.63867412771461]
We describe an implementation of the common model of cognition grounded in neural generative coding and holographic associative memory. The proposed system creates the groundwork for developing agents that learn continually from diverse tasks as well as model human performance at larger scales.
arXiv Detail & Related papers (2021-05-15T22:55:23Z)
Self-Reflective Variational Autoencoder [21.054722609128525]
Variational Autoencoder (VAE) is a powerful framework for learning latent variable generative models. We introduce a solution, which we call self-reflective inference. We empirically demonstrate the clear advantages of matching the variational posterior to the exact posterior.
arXiv Detail & Related papers (2020-07-10T05:05:26Z)
SEEK: Segmented Embedding of Knowledge Graphs [77.5307592941209]
We propose a lightweight modeling framework that can achieve highly competitive relational expressiveness without increasing the model complexity. Our framework focuses on the design of scoring functions and highlights two critical characteristics: 1) facilitating sufficient feature interactions; 2) preserving both symmetry and antisymmetry properties of relations.
arXiv Detail & Related papers (2020-05-02T15:15:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.