Related papers: Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers

Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers

URL: http://arxiv.org/abs/2601.15014v1
Date: Wed, 21 Jan 2026 14:13:38 GMT
Title: Efficient and Minimax-optimal In-context Nonparametric Regression with Transformers
Authors: Michelle Ching, Ioana Popescu, Nico Smith, Tianyi Ma, William G. Underwood, Richard J. Samworth,
Abstract summary: We prove that a pretrained transformer with $(log n) parameters and $bigl(n2/(2+d)log3 nbigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence.
Score: 5.687100661457289
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study in-context learning for nonparametric regression with $α$-Hölder smooth regression functions, for some $α>0$. We prove that, with $n$ in-context examples and $d$-dimensional regression covariates, a pretrained transformer with $Θ(\log n)$ parameters and $Ω\bigl(n^{2α/(2α+d)}\log^3 n\bigr)$ pretraining sequences can achieve the minimax-optimal rate of convergence $O\bigl(n^{-2α/(2α+d)}\bigr)$ in mean squared error. Our result requires substantially fewer transformer parameters and pretraining sequences than previous results in the literature. This is achieved by showing that transformers are able to approximate local polynomial estimators efficiently by implementing a kernel-weighted polynomial basis and then running gradient descent.

Related papers

INC: An Indirect Neural Corrector for Auto-Regressive Hybrid PDE Solvers [61.84396402100827]
We propose the Indirect Neural Corrector ($mathrmINC$), which integrates learned corrections into the governing equations.<n>$mathrmINC$ reduces the error amplification on the order of $t-1 + L$, where $t$ is the timestep and $L$ the Lipschitz constant.<n>We test $mathrmINC$ in extensive benchmarks, covering numerous differentiable solvers, neural backbones, and test cases ranging from a 1D chaotic system to 3D turbulence.
arXiv Detail & Related papers (2025-11-16T20:14:28Z)
Scale-Invariant Regret Matching and Online Learning with Optimal Convergence: Bridging Theory and Practice in Zero-Sum Games [60.871651115241406]
A considerable chasm has been looming for decades between theory and practice in zero-sum game solving through first-order methods.<n>We propose a new scale-invariance and parameter-free variant of PRM$+$, which we call IREG-PRM$+$.<n>We show that it achieves $T-1/2$ best-iterate and $T-1$ optimal convergence guarantees, while also being on par with PRM$+$ on benchmark games.
arXiv Detail & Related papers (2025-10-06T00:33:20Z)
Pretrained transformer efficiently learns low-dimensional target functions in-context [40.77319247558742]
We show that a nonlinear transformer optimized by gradient descent learns $f_*$ in-context with a prompt length that only depends on the dimension of the distribution of target functions $r$. Our result highlights the adaptivity of the pretrained transformer to low-dimensional structures of the function class, which enables sample-efficient ICL.
arXiv Detail & Related papers (2024-11-04T19:24:39Z)
In-context Learning for Mixture of Linear Regressions: Existence, Generalization and Training Dynamics [34.458004744956334]
We prove that there exists a transformer capable of achieving a prediction error of order $mathcalO(sqrtd/n)$ with high probability.<n>We also analyze the training dynamics of transformers with single linear self-attention layers, demonstrating that, with appropriately parameters, gradient flow optimization over the population mean square loss converges to a global optimum.
arXiv Detail & Related papers (2024-10-18T05:28:47Z)
Nonsmooth Nonparametric Regression via Fractional Laplacian Eigenmaps [14.003044924094597]
We develop nonparametric regression methods for the case when the true regression function is not necessarily smooth.<n>More specifically, our approach is using the fractional Laplacian and is designed to handle the case when the true regression function lies in an $L$-fractional Sobolev space with order $sin (0,1)$.
arXiv Detail & Related papers (2024-02-22T21:47:29Z)
Convergence analysis of online algorithms for vector-valued kernel regression [0.0]
We consider the problem of approximating the regression function $f_mu:, Omega to Y$ from noisy $mu$-distributed vector-valued data.<n>We provide estimates for the expected squared error in the RKHS norm of the approximations $f(m)in H$ obtained by a standard regularized online approximation algorithm.
arXiv Detail & Related papers (2023-09-14T15:10:47Z)
Transformers as Support Vector Machines [54.642793677472724]
We establish a formal equivalence between the optimization geometry of self-attention and a hard-margin SVM problem. We characterize the implicit bias of 1-layer transformers optimized with gradient descent. We believe these findings inspire the interpretation of transformers as a hierarchy of SVMs that separates and selects optimal tokens.
arXiv Detail & Related papers (2023-08-31T17:57:50Z)
Almost Linear Constant-Factor Sketching for $\ell_1$ and Logistic Regression [74.28017932704704]
We improve upon previous oblivious sketching and turnstile streaming results for $ell_1$ and logistic regression. We also give a tradeoff that yields a $1+varepsilon$ approximation in input sparsity time. Our sketch can be extended to approximate a regularized version of logistic regression where the data-dependent regularizer corresponds to the variance of the individual logistic losses.
arXiv Detail & Related papers (2023-03-31T18:12:33Z)
Generalization and Stability of Interpolating Neural Networks with Minimal Width [37.908159361149835]
We investigate the generalization and optimization of shallow neural-networks trained by gradient in the interpolating regime. We prove the training loss number minimizations $m=Omega(log4 (n))$ neurons and neurons $Tapprox n$. With $m=Omega(log4 (n))$ neurons and $Tapprox n$, we bound the test loss training by $tildeO (1/)$.
arXiv Detail & Related papers (2023-02-18T05:06:15Z)
Provably Efficient Convergence of Primal-Dual Actor-Critic with Nonlinear Function Approximation [15.319335698574932]
We show the first efficient convergence result with primal-dual actor-critic with a convergence of $mathcalOleft ascent(Nright)Nright)$ under Polyian sampling. Results on Open GymAI continuous control tasks.
arXiv Detail & Related papers (2022-02-28T15:16:23Z)
Mixability made efficient: Fast online multiclass logistic regression [68.8204255655161]
We show that mixability can be a powerful tool to obtain algorithms with optimal regret. The resulting methods often suffer from high computational complexity which has reduced their practical applicability.
arXiv Detail & Related papers (2021-10-08T08:22:05Z)
Online nonparametric regression with Sobolev kernels [99.12817345416846]
We derive the regret upper bounds on the classes of Sobolev spaces $W_pbeta(mathcalX)$, $pgeq 2, beta>fracdp$. The upper bounds are supported by the minimax regret analysis, which reveals that in the cases $beta> fracd2$ or $p=infty$ these rates are (essentially) optimal.
arXiv Detail & Related papers (2021-02-06T15:05:14Z)
Truncated Linear Regression in High Dimensions [26.41623833920794]
In truncated linear regression, $(A_i, y_i)_i$ whose dependent variable equals $y_i= A_irm T cdot x* + eta_i$ is some fixed unknown vector of interest. The goal is to recover $x*$ under some favorable conditions on the $A_i$'s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering $k$-sparse $n$-dimensional vectors $x*$ from $m$ truncated samples.
arXiv Detail & Related papers (2020-07-29T00:31:34Z)
Optimal Robust Linear Regression in Nearly Linear Time [97.11565882347772]
We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = langle X,w* rangle + epsilon$ We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $mathbbE [XXtop]$ has bounded condition number and $epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $epsilon$ is
arXiv Detail & Related papers (2020-07-16T06:44:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.