Joint Online Learning and Decision-making via Dual Mirror Descent
- URL: http://arxiv.org/abs/2104.09750v1
- Date: Tue, 20 Apr 2021 04:02:07 GMT
- Title: Joint Online Learning and Decision-making via Dual Mirror Descent
- Authors: Alfonso Lobos, Paul Grigas, Zheng Wen
- Abstract summary: We consider an online revenue problem over a finite time horizon subject to lower and upper bounds on cost.
We propose a novel offline benchmark that mixes an online dual mirror descent scheme with a generic parameter learning process.
When the parameter is not known and must be learned, we demonstrate that the regret and constraint violations are the sums of the previous $O(sqrtT)$ terms plus terms that directly depend on the convergence of the learning process.
- Score: 20.560099149143245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider an online revenue maximization problem over a finite time horizon
subject to lower and upper bounds on cost. At each period, an agent receives a
context vector sampled i.i.d. from an unknown distribution and needs to make a
decision adaptively. The revenue and cost functions depend on the context
vector as well as some fixed but possibly unknown parameter vector to be
learned. We propose a novel offline benchmark and a new algorithm that mixes an
online dual mirror descent scheme with a generic parameter learning process.
When the parameter vector is known, we demonstrate an $O(\sqrt{T})$ regret
result as well an $O(\sqrt{T})$ bound on the possible constraint violations.
When the parameter is not known and must be learned, we demonstrate that the
regret and constraint violations are the sums of the previous $O(\sqrt{T})$
terms plus terms that directly depend on the convergence of the learning
process.
Related papers
- Agnostic Smoothed Online Learning [5.167069404528051]
We propose an algorithm to guarantee sublinear regret for smoothed online learning without prior knowledge of $mu$.
R-Cover has adaptive regret $tilde O(sqrtdT/sigma)$ for function classes with dimension $d$, which is optimal up to logarithmic factors.
arXiv Detail & Related papers (2024-10-07T15:25:21Z) - Variance-Aware Regret Bounds for Stochastic Contextual Dueling Bandits [53.281230333364505]
This paper studies the problem of contextual dueling bandits, where the binary comparison of dueling arms is generated from a generalized linear model (GLM)
We propose a new SupLinUCB-type algorithm that enjoys computational efficiency and a variance-aware regret bound $tilde Obig(dsqrtsum_t=1Tsigma_t2 + dbig)$.
Our regret bound naturally aligns with the intuitive expectation in scenarios where the comparison is deterministic, the algorithm only suffers from an $tilde O(d)$ regret.
arXiv Detail & Related papers (2023-10-02T08:15:52Z) - Stochastic Contextual Bandits with Long Horizon Rewards [31.981315673799806]
This work takes a step in this direction by investigating contextual linear bandits where the current reward depends on most $s$ prior actions.
We propose new algorithms that leverage sparsity to discover the dependence pattern and arm parameters jointly.
Our results necessitate new analysis to address long-range temporal dependencies across data and avoid dependence on the reward horizon $h$.
arXiv Detail & Related papers (2023-02-02T01:25:30Z) - Stochastic Contextual Dueling Bandits under Linear Stochastic
Transitivity Models [25.336599480692122]
We consider the regret minimization task in a dueling bandits problem with context information.
We propose a computationally efficient algorithm, $texttCoLSTIM$, which makes its choice based on imitating the feedback process.
Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.
arXiv Detail & Related papers (2022-02-09T17:44:19Z) - PDE-Based Optimal Strategy for Unconstrained Online Learning [40.61498562988079]
We present a framework that generates time-varying potential functions by solving a Partial Differential Equation (PDE)
Our framework recovers some classical potentials, and more importantly provides a systematic approach to design new ones.
This is the first parameter-free algorithm with optimal leading constant.
arXiv Detail & Related papers (2022-01-19T22:21:21Z) - Online Apprenticeship Learning [58.45089581278177]
In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
arXiv Detail & Related papers (2021-02-13T12:57:51Z) - Online Markov Decision Processes with Aggregate Bandit Feedback [74.85532145498742]
We study a novel variant of online finite-horizon Markov Decision Processes with adversarially changing loss functions and initially unknown dynamics.
In each episode, the learner suffers the loss accumulated along the trajectory realized by the policy chosen for the episode, and observes aggregate bandit feedback.
Our main result is a computationally efficient algorithm with $O(sqrtK)$ regret for this setting, where $K$ is the number of episodes.
arXiv Detail & Related papers (2021-01-31T16:49:07Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z) - Learning in Markov Decision Processes under Constraints [34.03325546283489]
We consider reinforcement learning in Markov Decision Processes in which an agent repeatedly interacts with an environment that is modeled by a controlled Markov process.
We design model-based RL algorithms that maximize the cumulative reward earned over a time horizon of $T$ time-steps.
We show how to reduce the regret of a desired subset of the $M$ costs, at the expense of increasing the regrets of rewards and the remaining costs.
arXiv Detail & Related papers (2020-02-27T20:58:39Z) - Online Learning with Imperfect Hints [72.4277628722419]
We develop algorithms and nearly matching lower bounds for online learning with imperfect directional hints.
Our algorithms are oblivious to the quality of the hints, and the regret bounds interpolate between the always-correlated hints case and the no-hints case.
arXiv Detail & Related papers (2020-02-11T23:06:09Z) - Naive Exploration is Optimal for Online LQR [49.681825576239355]
We show that the optimal regret scales as $widetildeTheta(sqrtd_mathbfu2 d_mathbfx T)$, where $T$ is the number of time steps, $d_mathbfu$ is the dimension of the input space, and $d_mathbfx$ is the dimension of the system state.
Our lower bounds rule out the possibility of a $mathrmpoly(logT)$-regret algorithm, which had been
arXiv Detail & Related papers (2020-01-27T03:44:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.