On Understanding Attention-Based In-Context Learning for Categorical Data
- URL: http://arxiv.org/abs/2405.17248v2
- Date: Wed, 07 May 2025 00:13:15 GMT
- Title: On Understanding Attention-Based In-Context Learning for Categorical Data
- Authors: Aaron T. Wang, William Convertino, Xiang Cheng, Ricardo Henao, Lawrence Carin,
- Abstract summary: We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections.<n>This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations.
- Score: 49.40350941996942
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In-context learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections. This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations. We perform a theoretical analysis of this setup, generalizing many prior assumptions in this line of work, including the class of attention mechanisms for which it is appropriate. We demonstrate the framework empirically on synthetic data, image classification and language generation.
Related papers
- Concept-Based Mechanistic Interpretability Using Structured Knowledge Graphs [3.429783703166407]
Our framework enables a global dissection of model behavior by analyzing how high-level semantic attributes emerge, interact, and propagate through internal model components.<n>A key innovation is our visualization platform that we named BAGEL, which presents these insights in a structured knowledge graph.<n>Our framework is model-agnostic, scalable, and contributes to a deeper understanding of how deep learning models generalize (or fail to) in the presence of dataset biases.
arXiv Detail & Related papers (2025-07-08T09:30:20Z) - Next-token pretraining implies in-context learning [0.0]
We show how models adapt to context when trained on token sequences, especially from non-ergodic sources.<n>Our information-theoretic framework precisely predicts these in-distribution ICL dynamics.<n>We also show that a model's in-context performance on any task is mathematically coupled to the ensemble of tasks seen in pretraining.
arXiv Detail & Related papers (2025-05-23T21:00:18Z) - Model-agnostic basis functions for the 2-point correlation function of dark matter in linear theory [0.0]
We find a basis $mathcalB$ that describes $xi_rm lin(r)$ near the baryon acoustic oscillation feature in a wide class of cosmological models.
Using our basis functions in model-agnostic BAO analyses can potentially lead to significant statistical gains.
arXiv Detail & Related papers (2024-10-28T18:00:01Z) - Guarantees for Nonlinear Representation Learning: Non-identical Covariates, Dependent Data, Fewer Samples [24.45016514352055]
We study the sample-complexity of learning $T+1$ functions $f_star(t) circ g_star$ from a function class $mathcal F times mathcal G$.
We show that as the number of tasks $T$ increases, both the sample requirement and risk bound converge to that of $r$-dimensional regression.
arXiv Detail & Related papers (2024-10-15T03:20:19Z) - LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions [54.54897832889028]
We show that for any $K$, there is a universal set" $U subset [n]$ of size independent of $n$, such that for any $Q$ and any row $i$, the large attention scores $A_i,j$ in row $i$ of $A$ all have $jin U$.
We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well.
arXiv Detail & Related papers (2024-10-07T19:47:13Z) - IT$^3$: Idempotent Test-Time Training [95.78053599609044]
This paper introduces Idempotent Test-Time Training (IT$3$), a novel approach to addressing the challenge of distribution shift.
IT$3$ is based on the universal property of idempotence.
We demonstrate the versatility of our approach across various tasks, including corrupted image classification.
arXiv Detail & Related papers (2024-10-05T15:39:51Z) - Distributional Associations vs In-Context Reasoning: A Study of Feed-forward and Attention Layers [49.80959223722325]
We study the distinction between feed-forward and attention layers in large language models.<n>We find that feed-forward layers tend to learn simple distributional associations such as bigrams, while attention layers focus on in-context reasoning.
arXiv Detail & Related papers (2024-06-05T08:51:08Z) - Neural network learns low-dimensional polynomials with SGD near the information-theoretic limit [75.4661041626338]
We study the problem of gradient descent learning of a single-index target function $f_*(boldsymbolx) = textstylesigma_*left(langleboldsymbolx,boldsymbolthetarangleright)$
We prove that a two-layer neural network optimized by an SGD-based algorithm learns $f_*$ with a complexity that is not governed by information exponents.
arXiv Detail & Related papers (2024-06-03T17:56:58Z) - Uncovering hidden geometry in Transformers via disentangling position
and context [0.6118897979046375]
We present a simple yet informative decomposition of hidden states (or embeddings) of trained transformers into interpretable components.
For popular transformer architectures and diverse text datasets, empirically we find pervasive mathematical structure.
arXiv Detail & Related papers (2023-10-07T15:50:26Z) - Food Image Classification and Segmentation with Attention-based Multiple
Instance Learning [51.279800092581844]
The paper presents a weakly supervised methodology for training food image classification and semantic segmentation models.
The proposed methodology is based on a multiple instance learning approach in combination with an attention-based mechanism.
We conduct experiments on two meta-classes within the FoodSeg103 data set to verify the feasibility of the proposed approach.
arXiv Detail & Related papers (2023-08-22T13:59:47Z) - On the Joint Interaction of Models, Data, and Features [82.60073661644435]
We introduce a new tool, the interaction tensor, for empirically analyzing the interaction between data and model through features.
Based on these observations, we propose a conceptual framework for feature learning.
Under this framework, the expected accuracy for a single hypothesis and agreement for a pair of hypotheses can both be derived in closed-form.
arXiv Detail & Related papers (2023-06-07T21:35:26Z) - High-dimensional Asymptotics of Feature Learning: How One Gradient Step
Improves the Representation [89.21686761957383]
We study the first gradient descent step on the first-layer parameters $boldsymbolW$ in a two-layer network.
Our results demonstrate that even one step can lead to a considerable advantage over random features.
arXiv Detail & Related papers (2022-05-03T12:09:59Z) - ACTIVE:Augmentation-Free Graph Contrastive Learning for Partial
Multi-View Clustering [52.491074276133325]
We propose an augmentation-free graph contrastive learning framework to solve the problem of partial multi-view clustering.
The proposed approach elevates instance-level contrastive learning and missing data inference to the cluster-level, effectively mitigating the impact of individual missing data on clustering.
arXiv Detail & Related papers (2022-03-01T02:32:25Z) - Random matrices in service of ML footprint: ternary random features with
no performance loss [55.30329197651178]
We show that the eigenspectrum of $bf K$ is independent of the distribution of the i.i.d. entries of $bf w$.
We propose a novel random technique, called Ternary Random Feature (TRF)
The computation of the proposed random features requires no multiplication and a factor of $b$ less bits for storage compared to classical random features.
arXiv Detail & Related papers (2021-10-05T09:33:49Z) - Deep Relational Metric Learning [84.95793654872399]
This paper presents a deep relational metric learning framework for image clustering and retrieval.
We learn an ensemble of features that characterizes an image from different aspects to model both interclass and intraclass distributions.
Experiments on the widely-used CUB-200-2011, Cars196, and Stanford Online Products datasets demonstrate that our framework improves existing deep metric learning methods and achieves very competitive results.
arXiv Detail & Related papers (2021-08-23T09:31:18Z) - Mediated Uncoupled Learning: Learning Functions without Direct
Input-output Correspondences [80.95776331769899]
We consider the task of predicting $Y$ from $X$ when we have no paired data of them.
A naive approach is to predict $U$ from $X$ using $S_X$ and then $Y$ from $U$ using $S_Y$.
We propose a new method that avoids predicting $U$ but directly learns $Y = f(X)$ by training $f(X)$ with $S_X$ to predict $h(U)$.
arXiv Detail & Related papers (2021-07-16T22:13:29Z) - Generative Interventions for Causal Learning [27.371436971655303]
We introduce a framework for learning robust visual representations that generalize to new viewpoints, backgrounds, and scene contexts.
We show that we can steer generative models to manufacture interventions on features caused by confounding factors.
arXiv Detail & Related papers (2020-12-22T16:01:55Z) - On the Dynamics of Training Attention Models [30.85940880569692]
We study the dynamics of training a simple attention-based classification model using gradient descent.
We prove that training must converge to attending to the discriminative words when the attention output is classified by a linear classifier.
arXiv Detail & Related papers (2020-11-19T18:55:30Z) - Faster Uncertainty Quantification for Inverse Problems with Conditional
Normalizing Flows [0.9176056742068814]
In inverse problems, we often have data consisting of paired samples $(x,y)sim p_X,Y(x,y)$ where $y$ are partial observations of a physical system.
We propose a two-step scheme, which makes use of normalizing flows and joint data to train a conditional generator $q_theta(x|y)$.
arXiv Detail & Related papers (2020-07-15T20:36:30Z) - How isotropic kernels perform on simple invariants [0.5729426778193397]
We investigate how the training curve of isotropic kernel methods depends on the symmetry of the task to be learned.
We show that for large bandwidth, $beta = fracd-1+xi3d-3+xi$, where $xiin (0,2)$ is the exponent characterizing the stripe of the kernel at the origin.
arXiv Detail & Related papers (2020-06-17T09:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.