Bandit Labor Training
- URL: http://arxiv.org/abs/2006.06853v5
- Date: Wed, 16 Mar 2022 19:23:59 GMT
- Title: Bandit Labor Training
- Authors: Eren Ozbay, Vijay Kamble
- Abstract summary: On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs.
Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones.
We show that any policy must incur an instance-dependent regret of $Omega(log T)$ and a worst-case regret of $Omega(K2/3)$.
- Score: 2.28438857884398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: On-demand labor platforms aim to train a skilled workforce to serve its
incoming demand for jobs. Since limited jobs are available for training, and it
is usually not necessary to train all workers, efficient matching of training
jobs requires prioritizing fast learners over slow ones. However, the learning
rates of novice workers are unknown, resulting in a tradeoff between
exploration (learning the learning rates) and exploitation (training the best
workers). Motivated to study this tradeoff, we analyze a novel objective within
the stochastic multi-armed bandit framework. Given $K$ arms, instead of
maximizing the expected total reward from $T$ pulls (the traditional "sum"
objective), we consider the vector of cumulative rewards earned from the $K$
arms at the end of $T$ pulls and aim to maximize the expected highest
cumulative reward (the "max" objective). When rewards represent skill
increments, this corresponds to the objective of training a single highly
skilled worker from a set of novice workers, using a limited supply of training
jobs. For this objective, we show that any policy must incur an
instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher
instance-dependent constant) and a worst-case regret of
$\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy
featuring exploration based on appropriately tuned confidence bounds on the
mean reward and an adaptive stopping criterion, which adapts to the problem
difficulty and achieves these bounds (up to logarithmic factors). We generalize
our algorithmic insights to the problem of maximizing the expected value of the
average cumulative reward of the top $m$ arms with the highest cumulative
rewards, corresponding to the case where multiple workers must be trained. Our
numerical experiments demonstrate the efficacy of our policies compared to
several natural alternatives in practical parameter regimes.
Related papers
- Survival Multiarmed Bandits with Bootstrapping Methods [0.0]
The Survival Multiarmed Bandits (S-MAB) problem is an extension which constrains an agent to a budget related to observed rewards.
This paper presents a framework that addresses such a dual goal using an objective function balanced by a ruin aversion component.
arXiv Detail & Related papers (2024-10-21T20:21:10Z) - Reinforcement Learning with Quasi-Hyperbolic Discounting [2.3999111269325266]
Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias.
Our work significantly advances the practical application of reinforcement learning.
arXiv Detail & Related papers (2024-09-16T06:00:42Z) - Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood.
We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z) - STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Explore to Generalize in Zero-Shot RL [38.43215023828472]
We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task.
We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
arXiv Detail & Related papers (2023-06-05T17:49:43Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Skill-Based Reinforcement Learning with Intrinsic Reward Matching [77.34726150561087]
We present Intrinsic Reward Matching (IRM), which unifies task-agnostic skill pretraining and task-aware finetuning.
IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods.
arXiv Detail & Related papers (2022-10-14T00:04:49Z) - Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious.
We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data.
In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z) - Hierarchical Reinforcement Learning as a Model of Human Task
Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning.
The model reproduces known empirical effects of task interleaving.
The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.