Related papers: All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension

URL: http://arxiv.org/abs/2602.08350v1
Date: Mon, 09 Feb 2026 07:33:01 GMT
Title: All ERMs Can Fail in Stochastic Convex Optimization Lower Bounds in Linear Dimension
Authors: Tal Burla, Roi Livni,
Abstract summary: We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit.<n>We provide a novel generalization lower bound of $left(sqrtT/m1.5right)$ for Gradient Descent, where $$ is the learning rate, $T$ is the horizon and $m$ is the sample size.
Score: 14.982451024975733
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study the sample complexity of the best-case Empirical Risk Minimizer in the setting of stochastic convex optimization. We show that there exists an instance in which the sample size is linear in the dimension, learning is possible, but the Empirical Risk Minimizer is likely to be unique and to overfit. This resolves an open question by Feldman. We also extend this to approximate ERMs. Building on our construction we also show that (constrained) Gradient Descent potentially overfits when horizon and learning rate grow w.r.t sample size. Specifically we provide a novel generalization lower bound of $Ω\left(\sqrt{ηT/m^{1.5}}\right)$ for Gradient Descent, where $η$ is the learning rate, $T$ is the horizon and $m$ is the sample size. This narrows down, exponentially, the gap between the best known upper bound of $O(ηT/m)$ and existing lower bounds from previous constructions.

Related papers

Improved Rates of Differentially Private Nonconvex-Strongly-Concave Minimax Optimization [10.913566070767596]
We study the problem of (fin sum) minimax optimization in the Differential Privacy (DP) model.<n>We show that it is possible to get an estimator whose Descent $l$-norm of the empirical risk function is upper bounded by $tO(n)(n)$, whered is the sample size.
arXiv Detail & Related papers (2025-03-24T03:51:27Z)
The Sample Complexity of Gradient Descent in Stochastic Convex Optimization [14.268363583731848]
We show that the generalization error of full-batch Gradient Descent can be $tilde Theta(d/m + 1/sqrtm)$, where $d$ is the dimension and $m$ is the sample size. This matches the sample complexity of emphworst-case empirical risk minimizers.
arXiv Detail & Related papers (2024-04-07T12:07:33Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
The Dimension Strikes Back with Gradients: Generalization of Gradient Methods in Stochastic Convex Optimization [30.26365073195728]
We study the generalization performance of gradient methods in the fundamental convex optimization setting. We show that an application of the same construction technique provides a similar $Omega(sqrtd)$ lower bound for the sample complexity of SGD to reach a non-trivial empirical error.
arXiv Detail & Related papers (2024-01-22T15:50:32Z)
Lower Generalization Bounds for GD and SGD in Smooth Stochastic Convex Optimization [9.019243171993553]
Training steps $T$ and step-size $eta$ might affect certify in smooth convex optimization (SCO) problems. We first provide tight excess risk lower bounds for Gradient Descent (GD) and Gradient Descent (SGD) Recent works show better rates can be attained but the improvement is reduced when training time is long.
arXiv Detail & Related papers (2023-03-19T20:24:33Z)
Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model. The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z)
Settling the Sample Complexity of Model-Based Offline Reinforcement Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration. Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z)
Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning [52.76230802067506]
A novel model-free algorithm is proposed to minimize regret in episodic reinforcement learning. The proposed algorithm employs an em early-settled reference update rule, with the aid of two Q-learning sequences. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings.
arXiv Detail & Related papers (2021-10-09T21:13:48Z)
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection [71.9765117768556]
We consider the contextual bandit problem under the high dimensional linear model. This setting finds essential applications such as personalized recommendation, online advertisement, and personalized medicine. We propose doubly growing epochs and estimating the parameter using the best subset selection method.
arXiv Detail & Related papers (2020-09-04T04:10:39Z)
On Suboptimality of Least Squares with Application to Estimation of Convex Bodies [74.39616164169131]
We settle an open problem regarding optimality of Least Squares in estimating a convex set from noisy support function measurements in dimension $dgeq 6$. We establish that Least Squares is sub-optimal, and achieves a rate of $tildeTheta_d(n-2/(d-1))$ whereas the minimax rate is $Theta_d(n-4/(d+3))$.
arXiv Detail & Related papers (2020-06-07T05:19:00Z)
Breaking the Sample Size Barrier in Model-Based Reinforcement Learning with a Generative Model [50.38446482252857]
This paper is concerned with the sample efficiency of reinforcement learning, assuming access to a generative model (or simulator) We first consider $gamma$-discounted infinite-horizon Markov decision processes (MDPs) with state space $mathcalS$ and action space $mathcalA$. We prove that a plain model-based planning algorithm suffices to achieve minimax-optimal sample complexity given any target accuracy level.
arXiv Detail & Related papers (2020-05-26T17:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.