A Theory of Learning with Autoregressive Chain of Thought
- URL: http://arxiv.org/abs/2503.07932v1
- Date: Tue, 11 Mar 2025 00:21:32 GMT
- Title: A Theory of Learning with Autoregressive Chain of Thought
- Authors: Nirmit Joshi, Gal Vardi, Adam Block, Surbhi Goel, Zhiyuan Li, Theodor Misiakiewicz, Nathan Srebro,
- Abstract summary: We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs.<n>We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning.
- Score: 46.4004951842894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For a given base class of sequence-to-next-token generators, we consider learning prompt-to-answer mappings obtained by iterating a fixed, time-invariant generator for multiple steps, thus generating a chain-of-thought, and then taking the final token as the answer. We formalize the learning problems both when the chain-of-thought is observed and when training only on prompt-answer pairs, with the chain-of-thought latent. We analyze the sample and computational complexity both in terms of general properties of the base class (e.g. its VC dimension) and for specific base classes such as linear thresholds. We present a simple base class that allows for universal representability and computationally tractable chain-of-thought learning. Central to our development is that time invariance allows for sample complexity that is independent of the length of the chain-of-thought. Attention arises naturally in our construction.
Related papers
- Rademacher learning rates for iterated random functions [0.0]
We consider the case where the training dataset is generated by an iterated random function that is not necessarily irreducible or aperiodic.<n>Under the assumption that the governing function is contractive with respect to its first argument, we first establish a uniform convergence result for the corresponding sample error.<n>We then demonstrate the learnability of the approximate empirical risk minimization algorithm and derive its learning rate bound.
arXiv Detail & Related papers (2025-06-16T19:36:13Z) - A system identification approach to clustering vector autoregressive time series [50.66782357329375]
Clustering time series based on their underlying dynamics is keeping attracting researchers due to its impacts on assisting complex system modelling.<n>Most current time series clustering methods handle only scalar time series, treat them as white noise, or rely on domain knowledge for high-quality feature construction.<n>Instead of relying on feature/metric construction, the system identification approach allows treating vector time series clustering by explicitly considering their underlying autoregressive dynamics.
arXiv Detail & Related papers (2025-05-20T14:31:44Z) - Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention.
We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS.
We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z) - Auto-Regressive Next-Token Predictors are Universal Learners [17.416520406390415]
We show that even simple models such as linear next-token predictors can approximate any function efficiently computed by a Turing machine.
We also show experimentally that simple next-token predictors, such as linear networks and shallow Multi-Layer Perceptrons (MLPs), display non-trivial performance on text generation and arithmetic tasks.
arXiv Detail & Related papers (2023-09-13T14:15:03Z) - Momentum Contrastive Pre-training for Question Answering [54.57078061878619]
MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs.
Our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
arXiv Detail & Related papers (2022-12-12T08:28:22Z) - Arbitrarily Large Labelled Random Satisfiability Formulas for Machine
Learning Training [5.414308305392762]
We show how to generate correctly labeled random formulas of any desired size without having to solve the underlying decision problem.
We train existing state-of-the-art models for the task of predicting satisfiability on formulas with 10,000 variables.
We find that they do no better than random guessing 99% on the same datasets.
arXiv Detail & Related papers (2022-11-21T17:52:13Z) - Multi-Label Quantification [78.83284164605473]
Quantification, variously called "labelled prevalence estimation" or "learning to quantify", is the supervised learning task of generating predictors of the relative frequencies of the classes of interest in unsupervised data samples.
We propose methods for inferring estimators of class prevalence values that strive to leverage the dependencies among the classes of interest in order to predict their relative frequencies more accurately.
arXiv Detail & Related papers (2022-11-15T11:29:59Z) - Unsupervised Learning of Equivariant Structure from Sequences [30.974508897223124]
We present an unsupervised framework to learn the symmetry from the time sequence of length at least three.
We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product.
arXiv Detail & Related papers (2022-10-12T07:29:18Z) - Language Models Are Greedy Reasoners: A Systematic Formal Analysis of
Chain-of-Thought [10.524051272257614]
Large language models (LLMs) have shown remarkable reasoning capabilities given chain-of-thought prompts.
We present a new synthetic question-answering dataset called PrOntoQA, where each example is generated as a synthetic world model.
This allows us to parse the generated chain-of-thought into symbolic proofs for formal analysis.
arXiv Detail & Related papers (2022-10-03T21:34:32Z) - Complexity-Based Prompting for Multi-Step Reasoning [72.0057198610614]
We study the task of prompting large-scale language models to perform multi-step reasoning.
A central question is which reasoning examples make the most effective prompts.
We propose complexity-based prompting, a simple and effective example selection scheme for multi-step reasoning.
arXiv Detail & Related papers (2022-10-03T05:33:27Z) - Contrastive learning of strong-mixing continuous-time stochastic
processes [53.82893653745542]
Contrastive learning is a family of self-supervised methods where a model is trained to solve a classification task constructed from unlabeled data.
We show that a properly constructed contrastive learning task can be used to estimate the transition kernel for small-to-mid-range intervals in the diffusion case.
arXiv Detail & Related papers (2021-03-03T23:06:47Z) - Seq2Tens: An Efficient Representation of Sequences by Low-Rank Tensor
Projections [11.580603875423408]
Sequential data such as time series, video, or text can be challenging to analyse.
At the heart of this is non-commutativity, in the sense that reordering the elements of a sequence can completely change its meaning.
We use a classical mathematical object -- the tensor algebra -- to capture such dependencies.
arXiv Detail & Related papers (2020-06-12T09:24:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.