Counting and Algorithmic Generalization with Transformers
- URL: http://arxiv.org/abs/2310.08661v2
- Date: Fri, 12 Jan 2024 20:26:31 GMT
- Title: Counting and Algorithmic Generalization with Transformers
- Authors: Simon Ouellette, Rolf Pfister, Hansueli Jud
- Abstract summary: We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance.
We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Algorithmic generalization in machine learning refers to the ability to learn
the underlying algorithm that generates data in a way that generalizes
out-of-distribution. This is generally considered a difficult task for most
machine learning algorithms. Here, we analyze algorithmic generalization when
counting is required, either implicitly or explicitly. We show that standard
Transformers are based on architectural decisions that hinder
out-of-distribution performance for such tasks. In particular, we discuss the
consequences of using layer normalization and of normalizing the attention
weights via softmax. With ablation of the problematic operations, we
demonstrate that a modified transformer can exhibit a good algorithmic
generalization performance on counting while using a very lightweight
architecture.
Related papers
- Anomaly Detection with Machine Learning Algorithms in Large-Scale Power Grids [0.0]
We apply several machine learning algorithms to the problem of anomaly detection in operational data for large-scale, high-voltage electric power grids.<n>We observe important differences in the performance of the algorithms.<n>We show that unsupervised learning algorithm work remarkably well and that their predictions are robust against simultaneous, concurring anomalies.
arXiv Detail & Related papers (2026-02-11T14:17:43Z) - Understanding Dynamic Compute Allocation in Recurrent Transformers [23.760167933957707]
Token-level adaptive computation seeks to reduce inference cost by allocating more computation to harder tokens and less to easier ones.<n>Prior work is primarily evaluated on natural-token benchmarks using task-level metrics.<n>We introduce a complexity-controlled evaluation paradigm using algorithmic and synthetic language tasks with parameterized difficulty.
arXiv Detail & Related papers (2026-02-09T16:27:52Z) - Generalization Bounds for Transformer Channel Decoders [61.55280736553095]
This paper studies the generalization performance of ECCT from a learning-theoretic perspective.<n>To the best of our knowledge, this work provides the first theoretical generalization guarantees for this class of decoders.
arXiv Detail & Related papers (2026-01-11T15:56:37Z) - The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner [71.41162392872393]
This paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of large language models.<n>TAIL synthesizes chain-of-thoughts (CoT) data that imitates the execution process of a Turing Machine by computer programs.<n>Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks.
arXiv Detail & Related papers (2025-07-17T17:50:07Z) - Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks [5.522116934552708]
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood.<n>This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks.
arXiv Detail & Related papers (2024-07-25T11:35:22Z) - A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z) - Adaptivity and Modularity for Efficient Generalization Over Task
Complexity [42.748898521364914]
We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential steps.
We propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers.
arXiv Detail & Related papers (2023-10-13T05:29:09Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Combining Varied Learners for Binary Classification using Stacked
Generalization [3.1871776847712523]
This paper performs binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset.
The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.
arXiv Detail & Related papers (2022-02-17T21:47:52Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra [53.46106569419296]
We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression.
We argue that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise.
arXiv Detail & Related papers (2020-11-09T01:13:07Z) - Total Deep Variation: A Stable Regularizer for Inverse Problems [71.90933869570914]
We introduce the data-driven general-purpose total deep variation regularizer.
In its core, a convolutional neural network extracts local features on multiple scales and in successive blocks.
We achieve state-of-the-art results for numerous imaging tasks.
arXiv Detail & Related papers (2020-06-15T21:54:15Z) - A Brief Look at Generalization in Visual Meta-Reinforcement Learning [56.50123642237106]
We evaluate the generalization performance of meta-reinforcement learning algorithms.
We find that these algorithms can display strong overfitting when they are evaluated on challenging tasks.
arXiv Detail & Related papers (2020-06-12T15:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.