Counting and Algorithmic Generalization with Transformers
- URL: http://arxiv.org/abs/2310.08661v2
- Date: Fri, 12 Jan 2024 20:26:31 GMT
- Title: Counting and Algorithmic Generalization with Transformers
- Authors: Simon Ouellette, Rolf Pfister, Hansueli Jud
- Abstract summary: We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance.
We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Algorithmic generalization in machine learning refers to the ability to learn
the underlying algorithm that generates data in a way that generalizes
out-of-distribution. This is generally considered a difficult task for most
machine learning algorithms. Here, we analyze algorithmic generalization when
counting is required, either implicitly or explicitly. We show that standard
Transformers are based on architectural decisions that hinder
out-of-distribution performance for such tasks. In particular, we discuss the
consequences of using layer normalization and of normalizing the attention
weights via softmax. With ablation of the problematic operations, we
demonstrate that a modified transformer can exhibit a good algorithmic
generalization performance on counting while using a very lightweight
architecture.
Related papers
- A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm.
Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources.
We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z) - What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks.
Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer.
Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z) - Adaptivity and Modularity for Efficient Generalization Over Task
Complexity [42.748898521364914]
We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential steps.
We propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers.
arXiv Detail & Related papers (2023-10-13T05:29:09Z) - Transformers as Statisticians: Provable In-Context Learning with
In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL.
We show that transformers can implement a broad class of standard machine learning algorithms in context.
A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z) - Systematic Generalization and Emergent Structures in Transformers
Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions.
We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition.
These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z) - Combining Varied Learners for Binary Classification using Stacked
Generalization [3.1871776847712523]
This paper performs binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset.
The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.
arXiv Detail & Related papers (2022-02-17T21:47:52Z) - Fractal Structure and Generalization Properties of Stochastic
Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure.
We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z) - Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra [53.46106569419296]
We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression.
We argue that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise.
arXiv Detail & Related papers (2020-11-09T01:13:07Z) - Total Deep Variation: A Stable Regularizer for Inverse Problems [71.90933869570914]
We introduce the data-driven general-purpose total deep variation regularizer.
In its core, a convolutional neural network extracts local features on multiple scales and in successive blocks.
We achieve state-of-the-art results for numerous imaging tasks.
arXiv Detail & Related papers (2020-06-15T21:54:15Z) - A Brief Look at Generalization in Visual Meta-Reinforcement Learning [56.50123642237106]
We evaluate the generalization performance of meta-reinforcement learning algorithms.
We find that these algorithms can display strong overfitting when they are evaluated on challenging tasks.
arXiv Detail & Related papers (2020-06-12T15:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.