In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality
- URL: http://arxiv.org/abs/2508.08673v2
- Date: Sun, 31 Aug 2025 10:02:41 GMT
- Title: In-Context Learning as Nonparametric Conditional Probability Estimation: Risk Bounds and Optimality
- Authors: Chenrui Liu, Falong Tan, Chuanlong Xie, Yicheng Zeng, Lixing Zhu,
- Abstract summary: We formalize each task as a sequence of labeled examples followed by a query input; a pretrained model then estimates the query's conditional class probabilities.<n>The expected excess risk is defined as the average truncated Kullback-Leibler (KL) divergence between the predicted and true conditional class distributions.<n>We establish a new oracle inequality for this risk, based on KL divergence, in multiclass classification.
- Score: 9.893068784551879
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates the expected excess risk of in-context learning (ICL) for multiclass classification. We formalize each task as a sequence of labeled examples followed by a query input; a pretrained model then estimates the query's conditional class probabilities. The expected excess risk is defined as the average truncated Kullback-Leibler (KL) divergence between the predicted and true conditional class distributions over a specified family of tasks. We establish a new oracle inequality for this risk, based on KL divergence, in multiclass classification. This yields tight upper and lower bounds for transformer-based models, showing that the ICL estimator achieves the minimax optimal rate (up to logarithmic factors) for conditional probability estimation. From a technical standpoint, our results introduce a novel method for controlling generalization error via uniform empirical entropy. We further demonstrate that multilayer perceptrons (MLPs) can also perform ICL and attain the same optimal rate (up to logarithmic factors) under suitable assumptions, suggesting that effective ICL need not be exclusive to transformer architectures.
Related papers
- Uncertainty Estimation by Flexible Evidential Deep Learning [11.945854832533234]
Uncertainty (UQ) is crucial for deploying machine learning models in high-stakes applications.<n>Evidential deep learning (EDL) achieves efficiency by modeling uncertainty through the prediction of a Dirichlet distribution over class probabilities.<n>We propose $mathcalF$-EDL, which extends EDL by predicting a flexible Dirichlet distribution -- a generalization of the Dirichlet distribution -- over class probabilities.
arXiv Detail & Related papers (2025-10-21T06:12:33Z) - In-Context Learning Is Provably Bayesian Inference: A Generalization Theory for Meta-Learning [51.56484100374058]
We introduce a principled risk decomposition that separates the total ICL risk into two components: Bayes Gap and Posterior Variance.<n>For a uniform-attention Transformer, we derive a non-asymptotic upper bound on this gap, which explicitly clarifies the dependence on the number of pretraining prompts.<n>The Posterior Variance is a model-independent risk representing the intrinsic task uncertainty.
arXiv Detail & Related papers (2025-10-13T03:42:31Z) - Probabilistic Variational Contrastive Learning [8.23660331371415]
We propose a decoder-free framework that maximizes the evidence lower bound (ELBO)<n>We model the approximate posterior $q_theta(z|x)$ as a projected normal distribution, enabling the sampling of probabilistic embeddings.
arXiv Detail & Related papers (2025-06-11T20:26:07Z) - Nonparametric logistic regression with deep learning [1.0589208420411012]
In the nonparametric logistic regression, the Kullback-Leibler divergence could diverge easily.<n>Instead of analyzing the excess risk itself, it suffices to show the consistency of the maximum likelihood estimator.<n>As an important application, we derive convergence rates of the NPMLE with fully connected deep neural networks.
arXiv Detail & Related papers (2024-01-23T04:31:49Z) - Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z) - Likelihood Ratio Confidence Sets for Sequential Decision Making [51.66638486226482]
We revisit the likelihood-based inference principle and propose to use likelihood ratios to construct valid confidence sequences.
Our method is especially suitable for problems with well-specified likelihoods.
We show how to provably choose the best sequence of estimators and shed light on connections to online convex optimization.
arXiv Detail & Related papers (2023-11-08T00:10:21Z) - What and How does In-Context Learning Learn? Bayesian Model Averaging,
Parameterization, and Generalization [111.55277952086155]
We study In-Context Learning (ICL) by addressing several open questions.
We show that, without updating the neural network parameters, ICL implicitly implements the Bayesian model averaging algorithm.
We prove that the error of pretrained model is bounded by a sum of an approximation error and a generalization error.
arXiv Detail & Related papers (2023-05-30T21:23:47Z) - Learning from a Biased Sample [3.546358664345473]
We propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions.
We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data.
arXiv Detail & Related papers (2022-09-05T04:19:16Z) - Self-Certifying Classification by Linearized Deep Assignment [65.0100925582087]
We propose a novel class of deep predictors for classifying metric data on graphs within PAC-Bayes risk certification paradigm.
Building on the recent PAC-Bayes literature and data-dependent priors, this approach enables learning posterior distributions on the hypothesis space.
arXiv Detail & Related papers (2022-01-26T19:59:14Z) - Scaling Ensemble Distribution Distillation to Many Classes with Proxy
Targets [12.461503242570643]
emphEnsemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble.
For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion.
Although theoretically, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high.
arXiv Detail & Related papers (2021-05-14T17:50:14Z) - Estimation and Applications of Quantiles in Deep Binary Classification [0.0]
Quantile regression, based on check loss, is a widely used inferential paradigm in Statistics.
We consider the analogue of check loss in the binary classification setting.
We develop individualized confidence scores that can be used to decide whether a prediction is reliable.
arXiv Detail & Related papers (2021-02-09T07:07:42Z) - Amortized Conditional Normalized Maximum Likelihood: Reliable Out of
Distribution Uncertainty Estimation [99.92568326314667]
We propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation.
Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle.
We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.
arXiv Detail & Related papers (2020-11-05T08:04:34Z) - Selective Classification via One-Sided Prediction [54.05407231648068]
One-sided prediction (OSP) based relaxation yields an SC scheme that attains near-optimal coverage in the practically relevant high target accuracy regime.
We theoretically derive bounds generalization for SC and OSP, and empirically we show that our scheme strongly outperforms state of the art methods in coverage at small error levels.
arXiv Detail & Related papers (2020-10-15T16:14:27Z) - A One-step Approach to Covariate Shift Adaptation [82.01909503235385]
A default assumption in many machine learning scenarios is that the training and test samples are drawn from the same probability distribution.
We propose a novel one-step approach that jointly learns the predictive model and the associated weights in one optimization.
arXiv Detail & Related papers (2020-07-08T11:35:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.