Related papers: Bandit Labor Training

Bandit Labor Training

URL: http://arxiv.org/abs/2006.06853v5
Date: Wed, 16 Mar 2022 19:23:59 GMT
Title: Bandit Labor Training
Authors: Eren Ozbay, Vijay Kamble
Abstract summary: On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs. Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones. We show that any policy must incur an instance-dependent regret of $Omega(log T)$ and a worst-case regret of $Omega(K2/3)$.
Score: 2.28438857884398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: On-demand labor platforms aim to train a skilled workforce to serve its incoming demand for jobs. Since limited jobs are available for training, and it is usually not necessary to train all workers, efficient matching of training jobs requires prioritizing fast learners over slow ones. However, the learning rates of novice workers are unknown, resulting in a tradeoff between exploration (learning the learning rates) and exploitation (training the best workers). Motivated to study this tradeoff, we analyze a novel objective within the stochastic multi-armed bandit framework. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of cumulative rewards earned from the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest cumulative reward (the "max" objective). When rewards represent skill increments, this corresponds to the objective of training a single highly skilled worker from a set of novice workers, using a limited supply of training jobs. For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant) and a worst-case regret of $\Omega(K^{1/3}T^{2/3})$. We then design an explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We generalize our algorithmic insights to the problem of maximizing the expected value of the average cumulative reward of the top $m$ arms with the highest cumulative rewards, corresponding to the case where multiple workers must be trained. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes.

Related papers

Survival Multiarmed Bandits with Bootstrapping Methods [0.0]
The Survival Multiarmed Bandits (S-MAB) problem is an extension which constrains an agent to a budget related to observed rewards. This paper presents a framework that addresses such a dual goal using an objective function balanced by a ruin aversion component.
arXiv Detail & Related papers (2024-10-21T20:21:10Z)
Reinforcement Learning with Quasi-Hyperbolic Discounting [2.3999111269325266]
Quasi-Hyperbolic (QH) discounting is a simple alternative for modeling this bias. Our work significantly advances the practical application of reinforcement learning.
arXiv Detail & Related papers (2024-09-16T06:00:42Z)
Walking the Values in Bayesian Inverse Reinforcement Learning [66.68997022043075]
Key challenge in Bayesian IRL is bridging the computational gap between the hypothesis space of possible rewards and the likelihood. We propose ValueWalk - a new Markov chain Monte Carlo method based on this insight.
arXiv Detail & Related papers (2024-07-15T17:59:52Z)
REBEL: Reward Regularization-Based Approach for Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and human preferences can lead to catastrophic outcomes in the real world. Recent methods aim to mitigate misalignment by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Explore to Generalize in Zero-Shot RL [38.43215023828472]
We study zero-shot generalization in reinforcement learning-optimizing a policy on a set of training tasks to perform well on a similar but unseen test task. We show that our approach is the state-of-the-art on tasks of the ProcGen challenge that have thus far effective generalization, yielding a success rate of $83%$ on the Maze task and $74%$ on Heist with $200$ training levels.
arXiv Detail & Related papers (2023-06-05T17:49:43Z)
Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL) Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z)
Skill-Based Reinforcement Learning with Intrinsic Reward Matching [77.34726150561087]
We present Intrinsic Reward Matching (IRM), which unifies task-agnostic skill pretraining and task-aware finetuning. IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods.
arXiv Detail & Related papers (2022-10-14T00:04:49Z)
Semi-supervised reward learning for offline reinforcement learning [71.6909757718301]
Training agents usually requires reward functions, but rewards are seldom available in practice and their engineering is challenging and laborious. We propose semi-supervised learning algorithms that learn from limited annotations and incorporate unlabelled data. In our experiments with a simulated robotic arm, we greatly improve upon behavioural cloning and closely approach the performance achieved with ground truth rewards.
arXiv Detail & Related papers (2020-12-12T20:06:15Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
Hierarchical Reinforcement Learning as a Model of Human Task Interleaving [60.95424607008241]
We develop a hierarchical model of supervisory control driven by reinforcement learning. The model reproduces known empirical effects of task interleaving. The results support hierarchical RL as a plausible model of task interleaving.
arXiv Detail & Related papers (2020-01-04T17:53:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.