An Efficient Algorithm for Deep Stochastic Contextual Bandits
- URL: http://arxiv.org/abs/2104.05613v1
- Date: Mon, 12 Apr 2021 16:34:43 GMT
- Title: An Efficient Algorithm for Deep Stochastic Contextual Bandits
- Authors: Tan Zhu, Guannan Liang, Chunjiang Zhu, Haining Li, Jinbo Bi
- Abstract summary: In contextual bandit problems, an agent selects an action based on certain observed context to maximize the reward over iterations.
Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and is trained by a gradient based method.
- Score: 10.298368632706817
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In stochastic contextual bandit (SCB) problems, an agent selects an action
based on certain observed context to maximize the cumulative reward over
iterations. Recently there have been a few studies using a deep neural network
(DNN) to predict the expected reward for an action, and the DNN is trained by a
stochastic gradient based method. However, convergence analysis has been
greatly ignored to examine whether and where these methods converge. In this
work, we formulate the SCB that uses a DNN reward function as a non-convex
stochastic optimization problem, and design a stage-wise stochastic gradient
descent algorithm to optimize the problem and determine the action policy. We
prove that with high probability, the action sequence chosen by this algorithm
converges to a greedy action policy respecting a local optimal reward function.
Extensive experiments have been performed to demonstrate the effectiveness and
efficiency of the proposed algorithm on multiple real-world datasets.
Related papers
- Stochastic Unrolled Federated Learning [85.6993263983062]
We introduce UnRolled Federated learning (SURF), a method that expands algorithm unrolling to federated learning.
Our proposed method tackles two challenges of this expansion, namely the need to feed whole datasets to the unrolleds and the decentralized nature of federated learning.
arXiv Detail & Related papers (2023-05-24T17:26:22Z) - Constrained Reinforcement Learning via Dissipative Saddle Flow Dynamics [5.270497591225775]
In constrained reinforcement learning (C-RL), an agent seeks to learn from the environment a policy that maximizes the expected cumulative reward.
Several algorithms rooted in sampled-based primal-dual methods have been recently proposed to solve this problem in policy space.
We propose a novel algorithm for constrained RL that does not suffer from these limitations.
arXiv Detail & Related papers (2022-12-03T01:54:55Z) - Maximum-Likelihood Inverse Reinforcement Learning with Finite-Time
Guarantees [56.848265937921354]
Inverse reinforcement learning (IRL) aims to recover the reward function and the associated optimal policy.
Many algorithms for IRL have an inherently nested structure.
We develop a novel single-loop algorithm for IRL that does not compromise reward estimation accuracy.
arXiv Detail & Related papers (2022-10-04T17:13:45Z) - Structural Estimation of Markov Decision Processes in High-Dimensional
State Space with Finite-Time Guarantees [39.287388288477096]
We consider the task of estimating a structural model of dynamic decisions by a human agent based upon the observable history of implemented actions and visited states.
This problem has an inherent nested structure: in the inner problem, an optimal policy for a given reward function is identified while in the outer problem, a measure of fit is maximized.
We propose a single-loop estimation algorithm with finite time guarantees that is equipped to deal with high-dimensional state spaces.
arXiv Detail & Related papers (2022-10-04T00:11:38Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Local policy search with Bayesian optimization [73.0364959221845]
Reinforcement learning aims to find an optimal policy by interaction with an environment.
Policy gradients for local search are often obtained from random perturbations.
We develop an algorithm utilizing a probabilistic model of the objective function and its gradient.
arXiv Detail & Related papers (2021-06-22T16:07:02Z) - Zeroth-Order Hybrid Gradient Descent: Towards A Principled Black-Box
Optimization Framework [100.36569795440889]
This work is on the iteration of zero-th-order (ZO) optimization which does not require first-order information.
We show that with a graceful design in coordinate importance sampling, the proposed ZO optimization method is efficient both in terms of complexity as well as as function query cost.
arXiv Detail & Related papers (2020-12-21T17:29:58Z) - Stochastic Compositional Gradient Descent under Compositional
constraints [13.170519806372075]
We study constrained optimization problems where the objective and constraint functions are convex and expressed as compositions of functions.
The problem arises in fair classification/regression and in the design of queuing systems.
We prove that the proposed algorithm is guaranteed to find the optimal and feasible solution almost surely.
arXiv Detail & Related papers (2020-12-17T05:38:37Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.