Related papers: Counting and Algorithmic Generalization with Transformers

Counting and Algorithmic Generalization with Transformers

URL: http://arxiv.org/abs/2310.08661v2
Date: Fri, 12 Jan 2024 20:26:31 GMT
Title: Counting and Algorithmic Generalization with Transformers
Authors: Simon Ouellette, Rolf Pfister, Hansueli Jud
Abstract summary: We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance. We demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Algorithmic generalization in machine learning refers to the ability to learn the underlying algorithm that generates data in a way that generalizes out-of-distribution. This is generally considered a difficult task for most machine learning algorithms. Here, we analyze algorithmic generalization when counting is required, either implicitly or explicitly. We show that standard Transformers are based on architectural decisions that hinder out-of-distribution performance for such tasks. In particular, we discuss the consequences of using layer normalization and of normalizing the attention weights via softmax. With ablation of the problematic operations, we demonstrate that a modified transformer can exhibit a good algorithmic generalization performance on counting while using a very lightweight architecture.

Related papers

Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks [5.522116934552708]
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood.<n>This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks.
arXiv Detail & Related papers (2024-07-25T11:35:22Z)
A General Framework for Learning from Weak Supervision [93.89870459388185]
This paper introduces a general framework for learning from weak supervision (GLWS) with a novel algorithm. Central to GLWS is an Expectation-Maximization (EM) formulation, adeptly accommodating various weak supervision sources. We also present an advanced algorithm that significantly simplifies the EM computational demands.
arXiv Detail & Related papers (2024-02-02T21:48:50Z)
What Algorithms can Transformers Learn? A Study in Length Generalization [23.970598914609916]
We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Specifically, we leverage RASP -- a programming language designed for the computational model of a Transformer. Our work provides a novel perspective on the mechanisms of compositional generalization and the algorithmic capabilities of Transformers.
arXiv Detail & Related papers (2023-10-24T17:43:29Z)
Adaptivity and Modularity for Efficient Generalization Over Task Complexity [42.748898521364914]
We investigate how the use of a mechanism for adaptive and modular computation in transformers facilitates the learning of tasks that demand generalization over the number of sequential steps. We propose a transformer-based architecture called Hyper-UT, which combines dynamic function generation from hyper networks with adaptive depth from Universal Transformers.
arXiv Detail & Related papers (2023-10-13T05:29:09Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions. We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z)
Combining Varied Learners for Binary Classification using Stacked Generalization [3.1871776847712523]
This paper performs binary classification using Stacked Generalization on high dimensional Polycystic Ovary Syndrome dataset. The various metrics are given in this paper that also point out a subtle transgression found with Receiver Operating Characteristic Curve that was proved to be incorrect.
arXiv Detail & Related papers (2022-02-17T21:47:52Z)
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms [71.62575565990502]
We prove that the generalization error of an optimization algorithm can be bounded on the complexity' of the fractal structure that underlies its generalization measure. We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden/layered neural networks) and algorithms.
arXiv Detail & Related papers (2021-06-09T08:05:36Z)
Quantum-Inspired Algorithms from Randomized Numerical Linear Algebra [53.46106569419296]
We create classical (non-quantum) dynamic data structures supporting queries for recommender systems and least-squares regression. We argue that the previous quantum-inspired algorithms for these problems are doing leverage or ridge-leverage score sampling in disguise.
arXiv Detail & Related papers (2020-11-09T01:13:07Z)
Total Deep Variation: A Stable Regularizer for Inverse Problems [71.90933869570914]
We introduce the data-driven general-purpose total deep variation regularizer. In its core, a convolutional neural network extracts local features on multiple scales and in successive blocks. We achieve state-of-the-art results for numerous imaging tasks.
arXiv Detail & Related papers (2020-06-15T21:54:15Z)
A Brief Look at Generalization in Visual Meta-Reinforcement Learning [56.50123642237106]
We evaluate the generalization performance of meta-reinforcement learning algorithms. We find that these algorithms can display strong overfitting when they are evaluated on challenging tasks.
arXiv Detail & Related papers (2020-06-12T15:17:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.