Transformer In-Context Learning for Categorical Data
- URL: http://arxiv.org/abs/2405.17248v1
- Date: Mon, 27 May 2024 15:03:21 GMT
- Title: Transformer In-Context Learning for Categorical Data
- Authors: Aaron T. Wang, Ricardo Henao, Lawrence Carin,
- Abstract summary: We extend research on understanding Transformers through the lens of in-context learning with functional data by considering categorical outcomes, nonlinear underlying models, and nonlinear attention.
We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
- Score: 51.23121284812406
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent research has sought to understand Transformers through the lens of in-context learning with functional data. We extend that line of work with the goal of moving closer to language models, considering categorical outcomes, nonlinear underlying models, and nonlinear attention. The contextual data are of the form $\textsf{C}=(x_1,c_1,\dots,x_N,c_{N})$ where each $c_i\in\{0,\dots,C-1\}$ is drawn from a categorical distribution that depends on covariates $x_i\in\mathbb{R}^d$. Contextual outcomes in the $m$th set of contextual data, $\textsf{C}_m$, are modeled in terms of latent function $f_m(x)\in\textsf{F}$, where $\textsf{F}$ is a functional class with $(C-1)$-dimensional vector output. The probability of observing class $c\in\{0,\dots,C-1\}$ is modeled in terms of the output components of $f_m(x)$ via the softmax. The Transformer parameters may be trained with $M$ contextual examples, $\{\textsf{C}_m\}_{m=1,M}$, and the trained model is then applied to new contextual data $\textsf{C}_{M+1}$ for new $f_{M+1}(x)\in\textsf{F}$. The goal is for the Transformer to constitute the probability of each category $c\in\{0,\dots,C-1\}$ for a new query $x_{N_{M+1}+1}$. We assume each component of $f_m(x)$ resides in a reproducing kernel Hilbert space (RKHS), specifying $\textsf{F}$. Analysis and an extensive set of experiments suggest that on its forward pass the Transformer (with attention defined by the RKHS kernel) implements a form of gradient descent of the underlying function, connected to the latent vector function associated with the softmax. We present what is believed to be the first real-world demonstration of this few-shot-learning methodology, using the ImageNet dataset.
Related papers
- Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$ under isotropic Gaussian data.
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ of arbitrary link function with a sample and runtime complexity of $n asymp T asymp C(q) cdot d
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Learned Nonlinear Predictor for Critically Sampled 3D Point Cloud
Attribute Compression [24.001318485207207]
We study 3D point cloud compression via a decoder approach.
In this paper, we study predicting $f_l*$ at level $l+1$ given $f_l*$ $l$ and encoding of $G_l*$ for the $p=1$ case.
arXiv Detail & Related papers (2023-11-22T17:26:54Z) - Uncovering hidden geometry in Transformers via disentangling position
and context [0.6118897979046375]
We present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components.
For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure.
arXiv Detail & Related papers (2023-10-07T15:50:26Z) - An Over-parameterized Exponential Regression [18.57735939471469]
Recent developments in the field of Large Language Models (LLMs) have sparked interest in the use of exponential activation functions.
We define the neural function $F: mathbbRd times m times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd times mathbbRd
arXiv Detail & Related papers (2023-03-29T07:29:07Z) - Statistical Learning under Heterogeneous Distribution Shift [71.8393170225794]
Ground-truth predictor is additive $mathbbE[mathbfz mid mathbfx,mathbfy] = f_star(mathbfx) +g_star(mathbfy)$.
arXiv Detail & Related papers (2023-02-27T16:34:21Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Mediated Uncoupled Learning: Learning Functions without Direct
Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them.
A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$.
We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z) - Faster Uncertainty Quantification for Inverse Problems with Conditional
Normalizing Flows [0.9176056742068814]
In inverse problems, we often have data consisting of paired samples $(x,y)sim p_X,Y(x,y)$ where $y$ are partial observations of a physical system.
We propose a two-step scheme, which makes use of normalizing flows and joint data to train a conditional generator $q_theta(x|y)$.
arXiv Detail & Related papers (2020-07-15T20:36:30Z) - How isotropic kernels perform on simple invariants [0.5729426778193397]
We investigate how the training curve of isotropic kernel methods depends on the symmetry of the task to be learned.
We show that for large bandwidth, $beta = fracd-1+xi3d-3+xi$, where $xiin (0,2)$ is the exponent characterizing the stripe of the kernel at the origin.
arXiv Detail & Related papers (2020-06-17T09:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.