Learning Compositional Functions with Transformers from Easy-to-Hard Data
- URL: http://arxiv.org/abs/2505.23683v1
- Date: Thu, 29 May 2025 17:22:00 GMT
- Title: Learning Compositional Functions with Transformers from Easy-to-Hard Data
- Authors: Zixuan Wang, Eshaan Nichani, Alberto Bietti, Alex Damian, Daniel Hsu, Jason D. Lee, Denny Wu,
- Abstract summary: We study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations.<n>We show that this function class can be efficiently learned, with runtime and sample in $k$, by gradient descent on an $O(log k)$-depth transformer.
- Score: 63.96562216704653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based language models have demonstrated impressive capabilities across a range of complex reasoning tasks. Prior theoretical work exploring the expressive power of transformers has shown that they can efficiently perform multi-step reasoning tasks involving parallelizable computations. However, the learnability of such constructions, particularly the conditions on the data distribution that enable efficient learning via gradient-based optimization, remains an open question. Towards answering this question, in this work we study the learnability of the $k$-fold composition task, which requires computing an interleaved composition of $k$ input permutations and $k$ hidden permutations, and can be expressed by a transformer with $O(\log k)$ layers. On the negative front, we prove a Statistical Query (SQ) lower bound showing that any SQ learner that makes only polynomially-many queries to an SQ oracle for the $k$-fold composition task distribution must have sample size exponential in $k$, thus establishing a statistical-computational gap. On the other hand, we show that this function class can be efficiently learned, with runtime and sample complexity polynomial in $k$, by gradient descent on an $O(\log k)$-depth transformer via two different curriculum learning strategies: one in which data consists of $k'$-fold composition functions with $k' \le k$ presented in increasing difficulty, and another in which all such data is presented simultaneously. Our work sheds light on the necessity and sufficiency of having both easy and hard examples in the data distribution for transformers to learn complex compositional tasks.
Related papers
- In-Context Occam's Razor: How Transformers Prefer Simpler Hypotheses on the Fly [25.47694115798524]
In-context learning (ICL) enables transformers to adapt to new tasks through contextual examples without parameter updates.<n>This paper investigates how transformers navigate hierarchical task structures where higher-complexity categories can perfectly represent any pattern generated by simpler ones.
arXiv Detail & Related papers (2025-06-24T06:33:00Z) - Sample Complexity and Representation Ability of Test-time Scaling Paradigms [91.34339030453425]
Test-time scaling paradigms have advanced the capabilities of large language models (LLMs) on complex tasks.<n>We study the sample efficiency of various test-time strategies, such as self-consistency, best-of-$n$, and self-correction.<n>A single Transformer architecture can provably solve multiple tasks without prior knowledge of the specific task associated with a user query.
arXiv Detail & Related papers (2025-06-05T17:48:19Z) - Learning the symmetric group: large from small [44.99833362998488]
We show that a transformer neural-network trained on predicting permutations can generalize to the symmetric group $S_25$ with near 100% accuracy.<n>We employ identity augmentation as a key tool to manage variable word lengths, and partitioned windows for training on adjacent transpositions.
arXiv Detail & Related papers (2025-02-18T10:28:25Z) - Inverse Entropic Optimal Transport Solves Semi-supervised Learning via Data Likelihood Maximization [65.8915778873691]
conditional distributions is a central problem in machine learning.<n>We propose a new paradigm that integrates both paired and unpaired data.<n>We show that our approach can theoretically recover true conditional distributions with arbitrarily small error.
arXiv Detail & Related papers (2024-10-03T16:12:59Z) - Sampling Foundational Transformer: A Theoretical Perspective [12.7600763629179]
We propose Foundational Sampling Transformer (SFT) that can work on multiple data modalities.
SFT has achieved competitive results on many benchmarks, while being faster in inference, compared to other very specialized models.
arXiv Detail & Related papers (2024-08-11T16:53:09Z) - The Balanced-Pairwise-Affinities Feature Transform [2.3020018305241337]
TheBPA feature transform is designed to upgrade the features of a set of input items to facilitate downstream matching or grouping related tasks.
A particular min-cost-max-flow fractional matching problem leads to a transform which is efficient, differentiable, equivariant, parameterless and probabilistically interpretable.
Empirically, the transform is highly effective and flexible in its use and consistently improves networks it is inserted into, in a variety of tasks and training schemes.
arXiv Detail & Related papers (2024-06-25T14:28:05Z) - Computational-Statistical Gaps in Gaussian Single-Index Models [77.1473134227844]
Single-Index Models are high-dimensional regression problems with planted structure.
We show that computationally efficient algorithms, both within the Statistical Query (SQ) and the Low-Degree Polynomial (LDP) framework, necessarily require $Omega(dkstar/2)$ samples.
arXiv Detail & Related papers (2024-03-08T18:50:19Z) - Theoretical Understanding of In-Context Learning in Shallow Transformers with Unstructured Data [21.242708937367865]
Large language models (LLMs) are powerful models that can learn concepts at the inference stage via in-context learning (ICL)
This paper studies the role of each component in the transformer architecture and provides a theoretical understanding to explain the success of the architecture.
arXiv Detail & Related papers (2024-02-01T16:39:45Z) - What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
We present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer.
Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
arXiv Detail & Related papers (2022-05-27T14:36:55Z) - On the Theory of Transfer Learning: The Importance of Task Diversity [114.656572506859]
We consider $t+1$ tasks parameterized by functions of the form $f_j circ h$ in a general function class $mathcalF circ mathcalH$.
We show that for diverse training tasks the sample complexity needed to learn the shared representation across the first $t$ training tasks scales as $C(mathcalH) + t C(mathcalF)$.
arXiv Detail & Related papers (2020-06-20T20:33:59Z) - $O(n)$ Connections are Expressive Enough: Universal Approximability of
Sparse Transformers [71.31712741938837]
We show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n2$ connections.
We also present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
arXiv Detail & Related papers (2020-06-08T18:30:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.