Addressing the Long-term Impact of ML Decisions via Policy Regret
- URL: http://arxiv.org/abs/2106.01325v1
- Date: Wed, 2 Jun 2021 17:38:10 GMT
- Title: Addressing the Long-term Impact of ML Decisions via Policy Regret
- Authors: David Lindner and Hoda Heidari and Andreas Krause
- Abstract summary: We study a setting in which the reward from each arm evolves every time the decision-maker pulls that arm.
We argue that an acceptable sequential allocation of opportunities must take an arm's potential for growth into account.
We present an algorithm with provably sub-linear policy regret for sufficiently long time horizons.
- Score: 49.92903850297013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Machine Learning (ML) increasingly informs the allocation of opportunities to
individuals and communities in areas such as lending, education, employment,
and beyond. Such decisions often impact their subjects' future characteristics
and capabilities in an a priori unknown fashion. The decision-maker, therefore,
faces exploration-exploitation dilemmas akin to those in multi-armed bandits.
Following prior work, we model communities as arms. To capture the long-term
effects of ML-based allocation decisions, we study a setting in which the
reward from each arm evolves every time the decision-maker pulls that arm. We
focus on reward functions that are initially increasing in the number of pulls
but may become (and remain) decreasing after a certain point. We argue that an
acceptable sequential allocation of opportunities must take an arm's potential
for growth into account. We capture these considerations through the notion of
policy regret, a much stronger notion than the often-studied external regret,
and present an algorithm with provably sub-linear policy regret for
sufficiently long time horizons. We empirically compare our algorithm with
several baselines and find that it consistently outperforms them, in particular
for long time horizons.
Related papers
- Long-Term Fairness in Sequential Multi-Agent Selection with Positive Reinforcement [21.44063458579184]
In selection processes such as college admissions or hiring, biasing slightly towards applicants from under-represented groups is hypothesized to provide positive feedback.
We propose the Multi-agent Fair-Greedy policy, which balances greedy score and fairness.
Our results indicate that, while positive reinforcement is a promising mechanism for long-term fairness, policies must be designed carefully to be robust to variations in the evolution model.
arXiv Detail & Related papers (2024-07-10T04:03:23Z) - Reduced-Rank Multi-objective Policy Learning and Optimization [57.978477569678844]
In practice, causal researchers do not have a single outcome in mind a priori.
In government-assisted social benefit programs, policymakers collect many outcomes to understand the multidimensional nature of poverty.
We present a data-driven dimensionality-reduction methodology for multiple outcomes in the context of optimal policy learning.
arXiv Detail & Related papers (2024-04-29T08:16:30Z) - Multi-Armed Bandits with Generalized Temporally-Partitioned Rewards [0.4194295877935867]
In some real-world applications, feedback about a decision is delayed and may arrive via partial rewards that are observed with different delays.
We propose a novel problem formulation called multi-armed bandits with generalized temporally-partitioned rewards.
We derive a lower bound on the performance of any uniformly efficient algorithm for the considered problem.
arXiv Detail & Related papers (2023-03-01T16:22:22Z) - Mitigating Disparity while Maximizing Reward: Tight Anytime Guarantee
for Improving Bandits [6.537940428724029]
We study the Improving Multi-Armed Bandit (IMAB) problem, where the reward obtained from an arm increases with the number of pulls it receives.
Our main contribution is an anytime algorithm for the IMAB problem that achieves the best possible cumulative reward.
arXiv Detail & Related papers (2022-08-19T10:23:40Z) - The Paradox of Choice: Using Attention in Hierarchical Reinforcement
Learning [59.777127897688594]
We present an online, model-free algorithm to learn affordances that can be used to further learn subgoal options.
We investigate the role of hard versus soft attention in training data collection, abstract value learning in long-horizon tasks, and handling a growing number of choices.
arXiv Detail & Related papers (2022-01-24T13:18:02Z) - Stateful Strategic Regression [20.7177095411398]
We describe the Stackelberg equilibrium of the resulting game and provide novel algorithms for computing it.
Our analysis reveals several intriguing insights about the role of multiple interactions in shaping the game's outcome.
Most importantly, we show that with multiple rounds of interaction at her disposal, the principal is more effective at incentivizing the agent to accumulate effort in her desired direction.
arXiv Detail & Related papers (2021-06-07T17:46:29Z) - Risk Aware and Multi-Objective Decision Making with Distributional Monte
Carlo Tree Search [3.487620847066216]
We propose an algorithm that learns a posterior distribution over the utility of the different possible returns attainable from individual policy executions.
Our algorithm outperforms the state-of-the-art in multi-objective reinforcement learning for the expected utility of the returns.
arXiv Detail & Related papers (2021-02-01T16:47:39Z) - Off-Policy Evaluation of Slate Policies under Bayes Risk [70.10677881866047]
We study the problem of off-policy evaluation for slate bandits, for the typical case in which the logging policy factorizes over the slots of the slate.
We show that the risk improvement over PI grows linearly with the number of slots, and linearly with the gap between the arithmetic and the harmonic mean of a set of slot-level divergences.
arXiv Detail & Related papers (2021-01-05T20:07:56Z) - Online Model Selection: a Rested Bandit Formulation [49.69377391589057]
We introduce and analyze a best arm identification problem in the rested bandit setting.
We define a novel notion of regret for this problem, where we compare to the policy that always plays the arm having the smallest expected loss at the end of the game.
Unlike known model selection efforts in the recent bandit literature, our algorithm exploits the specific structure of the problem to learn the unknown parameters of the expected loss function.
arXiv Detail & Related papers (2020-12-07T08:23:08Z) - Upper Confidence Primal-Dual Reinforcement Learning for CMDP with
Adversarial Loss [145.54544979467872]
We consider online learning for episodically constrained Markov decision processes (CMDPs)
We propose a new emphupper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model.
Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning.
arXiv Detail & Related papers (2020-03-02T05:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.