Policy Optimization with Advantage Regularization for Long-Term Fairness
in Decision Systems
- URL: http://arxiv.org/abs/2210.12546v1
- Date: Sat, 22 Oct 2022 20:41:36 GMT
- Title: Policy Optimization with Advantage Regularization for Long-Term Fairness
in Decision Systems
- Authors: Eric Yang Yu, Zhizhen Qin, Min Kyung Lee, Sicun Gao
- Abstract summary: Long-term fairness is an important factor of consideration in designing and deploying learning-based decision systems.
Recent work has proposed the use of Markov Decision Processes (MDPs) to formulate decision-making with long-term fairness requirements.
We show that policy optimization methods from deep reinforcement learning can be used to find strictly better decision policies.
- Score: 14.095401339355677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Long-term fairness is an important factor of consideration in designing and
deploying learning-based decision systems in high-stake decision-making
contexts. Recent work has proposed the use of Markov Decision Processes (MDPs)
to formulate decision-making with long-term fairness requirements in
dynamically changing environments, and demonstrated major challenges in
directly deploying heuristic and rule-based policies that worked well in static
environments. We show that policy optimization methods from deep reinforcement
learning can be used to find strictly better decision policies that can often
achieve both higher overall utility and less violation of the fairness
requirements, compared to previously-known strategies. In particular, we
propose new methods for imposing fairness requirements in policy optimization
by regularizing the advantage evaluation of different actions. Our proposed
methods make it easy to impose fairness constraints without reward engineering
or sacrificing training efficiency. We perform detailed analyses in three
established case studies, including attention allocation in incident
monitoring, bank loan approval, and vaccine distribution in population
networks.
Related papers
- Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning [7.085987593010675]
This work investigates the offline formulation of the contextual bandit problem.
The goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies.
We introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators.
arXiv Detail & Related papers (2024-05-23T09:07:27Z) - Conditions on Preference Relations that Guarantee the Existence of Optimal Policies [38.17324903156351]
We introduce a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments.
We show that a decision-making problem can have optimal policies even when no reward function can express the learning goal.
arXiv Detail & Related papers (2023-11-03T15:42:12Z) - Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate [41.51680686036846]
We introduce a long-term fairness concept named Equal Long-term Benefit Rate (ELBERT) to address biases in sequential decision-making.
ELBERT effectively addresses the temporal discrimination issues found in previous long-term fairness notions.
We show that ELBERT-PO significantly diminishes bias while maintaining high utility.
arXiv Detail & Related papers (2023-09-07T01:10:01Z) - Reinforcement Learning with Stepwise Fairness Constraints [50.538878453547966]
We introduce the study of reinforcement learning with stepwise fairness constraints.
We provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violation.
arXiv Detail & Related papers (2022-11-08T04:06:23Z) - Safe Policy Learning through Extrapolation: Application to Pre-trial
Risk Assessment [0.0]
We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy.
We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations.
We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument.
arXiv Detail & Related papers (2021-09-22T00:52:03Z) - Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit.
We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner.
Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z) - Universal Trading for Order Execution with Oracle Policy Distillation [99.57416828489568]
We propose a novel universal trading policy optimization framework to bridge the gap between the noisy yet imperfect market states and the optimal action sequences for order execution.
We show that our framework can better guide the learning of the common policy towards practically optimal execution by an oracle teacher with perfect information.
arXiv Detail & Related papers (2021-01-28T05:52:18Z) - Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions.
We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions.
We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z) - Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy.
We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z) - SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy.
Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates.
We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z) - Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL)
We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another.
Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.