Oracle-Efficient Hybrid Online Learning with Unknown Distribution
- URL: http://arxiv.org/abs/2401.15520v1
- Date: Sat, 27 Jan 2024 22:45:02 GMT
- Title: Oracle-Efficient Hybrid Online Learning with Unknown Distribution
- Authors: Changlong Wu, Jin Sima, Wojciech Szpankowski
- Abstract summary: We show that there exists a computationally efficient online predictor that a regret upper bounded by $tildeO(Tfrac34)$ for a finite-VC class, and upper bounded by $tildeO(Tfracp+1p+2)$ for a class with $alpha$ fat-shattering dimension.
We then extend our result to the scenario of shifting distributions with $K$ changes, yielding a regret of order $tildeO(Tfrac45Kfrac15
- Score: 16.73555330791045
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of oracle-efficient hybrid online learning when the
features are generated by an unknown i.i.d. process and the labels are
generated adversarially. Assuming access to an (offline) ERM oracle, we show
that there exists a computationally efficient online predictor that achieves a
regret upper bounded by $\tilde{O}(T^{\frac{3}{4}})$ for a finite-VC class, and
upper bounded by $\tilde{O}(T^{\frac{p+1}{p+2}})$ for a class with $\alpha$
fat-shattering dimension $\alpha^{-p}$. This provides the first known
oracle-efficient sublinear regret bounds for hybrid online learning with an
unknown feature generation process. In particular, it confirms a conjecture of
Lazaric and Munos (JCSS 2012). We then extend our result to the scenario of
shifting distributions with $K$ changes, yielding a regret of order
$\tilde{O}(T^{\frac{4}{5}}K^{\frac{1}{5}})$. Finally, we establish a regret of
$\tilde{O}((K^{\frac{2}{3}}(\log|\mathcal{H}|)^{\frac{1}{3}}+K)\cdot
T^{\frac{4}{5}})$ for the contextual $K$-armed bandits with a finite policy set
$\mathcal{H}$, i.i.d. generated contexts from an unknown distribution, and
adversarially generated costs.
Related papers
- Enjoying Non-linearity in Multinomial Logistic Bandits [66.36004256710824]
We consider the multinomial logistic bandit problem, a variant of generalized linear bandits where a learner interacts with an environment.<n>We extend the definition of $kappa_*$ to the multinomial setting and propose an efficient algorithm that leverages the problem's non-linearity.<n>Our method yields a problem-dependent regret bound of order $ smashwidetildemathcalO( Kd sqrtT/kappa_*) $, where $K$ is the number of actions and $kappa_*
arXiv Detail & Related papers (2025-07-07T08:18:25Z) - Improved and Oracle-Efficient Online $\ell_1$-Multicalibration [14.147331133778916]
We study emphonline multicalibration, a framework for ensuring calibrated predictions across multiple groups in adversarial settings.<n>We propose a direct method that achieves improved and oracle-efficient rates of $widetildemathcalO(T-1/4)$.<n>Our framework also extends to certain infinite families of groups by exploiting a $1$-Lipschitz property of the (ell_H)-multicalibration error.
arXiv Detail & Related papers (2025-05-23T00:37:49Z) - Learning and Computation of $Φ$-Equilibria at the Frontier of Tractability [85.07238533644636]
$Phi$-equilibria is a powerful and flexible framework at the heart of online learning and game theory.
We show that an efficient online algorithm incurs average $Phi$-regret at most $epsilon$ using $textpoly(d, k)/epsilon2$ rounds.
We also show nearly matching lower bounds in the online setting, thereby obtaining for the first time a family of deviations that captures the learnability of $Phi$-regret.
arXiv Detail & Related papers (2025-02-25T19:08:26Z) - Sign Operator for Coping with Heavy-Tailed Noise in Non-Convex Optimization: High Probability Bounds Under $(L_0, L_1)$-Smoothness [74.18546828528298]
We show that SignSGD with Majority Voting can robustly work on the whole range of complexity with $kappakappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappakappakappa-1right, kappa
arXiv Detail & Related papers (2025-02-11T19:54:11Z) - Online Learning of Halfspaces with Massart Noise [47.71073318490341]
We study the task of online learning in the presence of Massart noise.
We present a computationally efficient algorithm that achieves mistake bound $eta T + o(T)$.
We use our Massart online learner to design an efficient bandit algorithm that obtains expected reward at least $(1-1/k) Delta T - o(T)$ bigger than choosing a random action at every round.
arXiv Detail & Related papers (2024-05-21T17:31:10Z) - Federated Combinatorial Multi-Agent Multi-Armed Bandits [79.1700188160944]
This paper introduces a federated learning framework tailored for online optimization with bandit.
In this setting, agents subsets of arms, observe noisy rewards for these subsets without accessing individual arm information, and can cooperate and share information at specific intervals.
arXiv Detail & Related papers (2024-05-09T17:40:09Z) - An Improved Relaxation for Oracle-Efficient Adversarial Contextual
Bandits [14.058574632553547]
We present an oracle-efficient relaxation for the adversarial contextual bandits problem.
Our algorithm has a regret bound of $O(Tfrac23(Klog(|Pi|))frac13)$ and makes at most $O(K)$ calls per round to an offline optimization.
arXiv Detail & Related papers (2023-10-29T14:31:34Z) - Adversarial Online Multi-Task Reinforcement Learning [12.421997449847153]
We consider the adversarial online multi-task reinforcement learning setting.
In each of $K$ episodes the learner is given an unknown task taken from a finite set of $M$ unknown finite-horizon MDP models.
The learner's objective is to generalize its regret with respect to the optimal policy for each task.
arXiv Detail & Related papers (2023-01-11T02:18:26Z) - Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning [54.806166861456035]
We study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches.
We design a computational efficient algorithm to achieve near-optimal regret of $tildeO(sqrtSAH3Kln (1/delta))$tildeO(cdot) hides logarithmic terms of $(S,A,H,K)$ in $K$ episodes.
Our technical contribution are two-fold: 1) a near-optimal design scheme to explore
arXiv Detail & Related papers (2022-10-15T09:22:22Z) - Optimism in Face of a Context: Regret Guarantees for Stochastic
Contextual MDP [46.86114958340962]
We present regret algorithms for contextual MDPs under minimum reachability assumption.
Our approach is the first optimistic approach applied to contextual MDPs with general function approximation.
arXiv Detail & Related papers (2022-07-22T15:00:15Z) - Oracle-Efficient Online Learning for Beyond Worst-Case Adversaries [29.598518028635716]
We study oracle-efficient algorithms for beyond worst-case analysis of online learning.
For the smoothed analysis setting, our results give the first oracle-efficient algorithm for online learning with smoothed adversaries.
arXiv Detail & Related papers (2022-02-17T09:49:40Z) - Learning Stochastic Shortest Path with Linear Function Approximation [74.08819218747341]
We study the shortest path (SSP) problem in reinforcement learning with linear function approximation, where the transition kernel is represented as a linear mixture of unknown models.
We propose a novel algorithm for learning the linear mixture SSP, which can attain a $tilde O(d B_star1.5sqrtK/c_min)$ regret.
arXiv Detail & Related papers (2021-10-25T08:34:00Z) - Model Selection with Near Optimal Rates for Reinforcement Learning with
General Model Classes [27.361399036211694]
We address the problem of model selection for the finite horizon episodic Reinforcement Learning (RL) problem.
In the model selection framework, instead of $mathcalP*$, we are given $M$ nested families of transition kernels.
We show that textttARL-GEN obtains a regret of $TildemathcalO(d_mathcalE*H2+sqrtd_mathcalE* mathbbM* H2 T)$
arXiv Detail & Related papers (2021-07-13T05:00:38Z) - Nearly Horizon-Free Offline Reinforcement Learning [97.36751930393245]
We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes with $S$ states, $A$ actions and planning horizon $H$.
We obtain the first set of nearly $H$-free sample complexity bounds for evaluation and planning using the empirical MDPs.
arXiv Detail & Related papers (2021-03-25T18:52:17Z) - Variance-Aware Confidence Set: Variance-Dependent Bound for Linear
Bandits and Horizon-Free Bound for Linear Mixture MDP [76.94328400919836]
We show how to construct variance-aware confidence sets for linear bandits and linear mixture Decision Process (MDP)
For linear bandits, we obtain an $widetildeO(mathrmpoly(d)sqrt1 + sum_i=1Ksigma_i2) regret bound, where $d is the feature dimension.
For linear mixture MDP, we obtain an $widetildeO(mathrmpoly(d)sqrtK)$ regret bound, where
arXiv Detail & Related papers (2021-01-29T18:57:52Z) - Taking a hint: How to leverage loss predictors in contextual bandits? [63.546913998407405]
We study learning in contextual bandits with the help of loss predictors.
We show that the optimal regret is $mathcalO(minsqrtT, sqrtmathcalETfrac13)$ when $mathcalE$ is known.
arXiv Detail & Related papers (2020-03-04T07:36:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.