Related papers: Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context

URL: http://arxiv.org/abs/2312.06528v6
Date: Tue, 4 Jun 2024 00:20:05 GMT
Title: Transformers Implement Functional Gradient Descent to Learn Non-Linear Functions In Context
Authors: Xiang Cheng, Yuxin Chen, Suvrit Sra,
Abstract summary: We show that (non-linear) Transformers naturally learn to implement gradient descent in function space. We also show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned.
Score: 44.949726166566236
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter configurations. This paper provides theoretical and empirical evidence that (non-linear) Transformers naturally learn to implement gradient descent in function space, which in turn enable them to learn non-linear functions in context. Our results apply to a broad class of combinations of non-linear architectures and non-linear in-context learning tasks. Additionally, we show that the optimal choice of non-linear activation depends in a natural way on the class of functions that need to be learned.

Related papers

Graded Transformers: A Symbolic-Geometric Approach to Structured Learning [0.0]
We introduce a novel class of sequence models that embed inductive biases through grading transformations on vector spaces.<n>The Graded Transformer holds transformative potential for hierarchical learning and neurosymbolic reasoning.<n>This work advances structured deep learning by fusing geometric and algebraic principles with attention mechanisms.
arXiv Detail & Related papers (2025-07-27T02:34:08Z)
Transformers Meet In-Context Learning: A Universal Approximation Theory [25.109347875620436]
We develop a universal approximation theory to better understand how transformers enable in-context learning.<n>Our work sheds light on how transformers can simultaneously learn general-purpose representations and adapt dynamically to in-context examples.
arXiv Detail & Related papers (2025-06-05T16:12:51Z)
Learning Linear Attention in Polynomial Time [115.68795790532289]
We provide the first results on learnability of single-layer Transformers with linear attention. We show that linear attention may be viewed as a linear predictor in a suitably defined RKHS. We show how to efficiently identify training datasets for which every empirical riskr is equivalent to the linear Transformer.
arXiv Detail & Related papers (2024-10-14T02:41:01Z)
Operator Learning Using Random Features: A Tool for Scientific Computing [3.745868534225104]
Supervised operator learning centers on the use of training data to estimate maps between infinite-dimensional spaces. This paper introduces the function-valued random features method. It leads to a supervised operator learning architecture that is practical for nonlinear problems.
arXiv Detail & Related papers (2024-08-12T23:10:39Z)
How Well Can Transformers Emulate In-context Newton's Method? [46.08521978754298]
We study whether Transformers can perform higher order optimization methods, beyond the case of linear regression. We demonstrate the ability of even linear attention-only Transformers in implementing a single step of Newton's iteration for matrix inversion with merely two layers.
arXiv Detail & Related papers (2024-03-05T18:20:10Z)
Understanding In-Context Learning in Transformers and LLMs by Learning to Learn Discrete Functions [32.59746882017483]
We show that Transformers can learn to implement two distinct algorithms to solve a single task. We also show that extant Large Language Models (LLMs) can compete with nearest-neighbor baselines on prediction tasks.
arXiv Detail & Related papers (2023-10-04T17:57:33Z)
Pessimistic Nonlinear Least-Squares Value Iteration for Offline Reinforcement Learning [53.97335841137496]
We propose an oracle-efficient algorithm, dubbed Pessimistic Least-Square Value Iteration (PNLSVI) for offline RL with non-linear function approximation. Our algorithm enjoys a regret bound that has a tight dependency on the function class complexity and achieves minimax optimal instance-dependent regret when specialized to linear function approximation.
arXiv Detail & Related papers (2023-10-02T17:42:01Z)
Transformers as Statisticians: Provable In-Context Learning with In-Context Algorithm Selection [88.23337313766353]
This work first provides a comprehensive statistical theory for transformers to perform ICL. We show that transformers can implement a broad class of standard machine learning algorithms in context. A emphsingle transformer can adaptively select different base ICL algorithms.
arXiv Detail & Related papers (2023-06-07T17:59:31Z)
A Tutorial on Neural Networks and Gradient-free Training [0.0]
This paper presents a compact, matrix-based representation of neural networks in a self-contained tutorial fashion. neural networks are mathematical nonlinear functions constructed by composing several vector-valued functions.
arXiv Detail & Related papers (2022-11-26T15:33:11Z)
Stabilizing Q-learning with Linear Architectures for Provably Efficient Learning [53.17258888552998]
This work proposes an exploration variant of the basic $Q$-learning protocol with linear function approximation. We show that the performance of the algorithm degrades very gracefully under a novel and more permissive notion of approximation error.
arXiv Detail & Related papers (2022-06-01T23:26:51Z)
Exploring Linear Feature Disentanglement For Neural Networks [63.20827189693117]
Non-linear activation functions, e.g., Sigmoid, ReLU, and Tanh, have achieved great success in neural networks (NNs) Due to the complex non-linear characteristic of samples, the objective of those activation functions is to project samples from their original feature space to a linear separable feature space. This phenomenon ignites our interest in exploring whether all features need to be transformed by all non-linear functions in current typical NNs.
arXiv Detail & Related papers (2022-03-22T13:09:17Z)
Parametric Rectified Power Sigmoid Units: Learning Nonlinear Neural Transfer Analytical Forms [1.6975704972827304]
The paper proposes representation functionals in a dual paradigm where learning jointly concerns linear convolutional weights and parametric forms of nonlinear activation functions. The nonlinear forms proposed for performing the functional representation are associated with a new class of parametric neural transfer functions called rectified power sigmoid units. Performance achieved by the joint learning of convolutional and rectified power sigmoid learnable parameters are shown outstanding in both shallow and deep learning frameworks.
arXiv Detail & Related papers (2021-01-25T08:25:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.