Related papers: Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems

URL: http://arxiv.org/abs/2210.12546v1
Date: Sat, 22 Oct 2022 20:41:36 GMT
Title: Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems
Authors: Eric Yang Yu, Zhizhen Qin, Min Kyung Lee, Sicun Gao
Abstract summary: Long-term fairness is an important factor of consideration in designing and deploying learning-based decision systems. Recent work has proposed the use of Markov Decision Processes (MDPs) to formulate decision-making with long-term fairness requirements. We show that policy optimization methods from deep reinforcement learning can be used to find strictly better decision policies.
Score: 14.095401339355677
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-term fairness is an important factor of consideration in designing and deploying learning-based decision systems in high-stake decision-making contexts. Recent work has proposed the use of Markov Decision Processes (MDPs) to formulate decision-making with long-term fairness requirements in dynamically changing environments, and demonstrated major challenges in directly deploying heuristic and rule-based policies that worked well in static environments. We show that policy optimization methods from deep reinforcement learning can be used to find strictly better decision policies that can often achieve both higher overall utility and less violation of the fairness requirements, compared to previously-known strategies. In particular, we propose new methods for imposing fairness requirements in policy optimization by regularizing the advantage evaluation of different actions. Our proposed methods make it easy to impose fairness constraints without reward engineering or sacrificing training efficiency. We perform detailed analyses in three established case studies, including attention allocation in incident monitoring, bank loan approval, and vaccine distribution in population networks.

Related papers

Learning Fair Policies for Infectious Diseases Mitigation using Path Integral Control [0.4583163610461423]
Infectious diseases pose major public health challenges to society. We propose a framework for sequential decision-making under uncertainty to design fairness-aware disease mitigation policies.
arXiv Detail & Related papers (2025-02-14T00:08:06Z)
Off-Policy Evaluation and Counterfactual Methods in Dynamic Auction Environments [0.6445605125467574]
Off-Policy Evaluation allows researchers to assess new policies without costly experiments, speeding up the evaluation process. Online experimental methods, such as A/B tests, are effective but often slow, thus delaying the policy selection and optimization process. By utilizing counterfactual estimators as a preliminary step before conducting A/B tests, we aim to streamline the evaluation process.
arXiv Detail & Related papers (2025-01-09T14:39:40Z)
Logarithmic Smoothing for Pessimistic Off-Policy Evaluation, Selection and Learning [7.085987593010675]
This work investigates the offline formulation of the contextual bandit problem. The goal is to leverage past interactions collected under a behavior policy to evaluate, select, and learn new, potentially better-performing, policies. We introduce novel, fully empirical concentration bounds for a broad class of importance weighting risk estimators.
arXiv Detail & Related papers (2024-05-23T09:07:27Z)
Conditions on Preference Relations that Guarantee the Existence of Optimal Policies [38.17324903156351]
We introduce a new framework for analyzing LfPF problems in partially-observable, non-Markovian environments. We show that a decision-making problem can have optimal policies even when no reward function can express the learning goal.
arXiv Detail & Related papers (2023-11-03T15:42:12Z)
Adapting Static Fairness to Sequential Decision-Making: Bias Mitigation Strategies towards Equal Long-term Benefit Rate [41.51680686036846]
We introduce a long-term fairness concept named Equal Long-term Benefit Rate (ELBERT) to address biases in sequential decision-making. ELBERT effectively addresses the temporal discrimination issues found in previous long-term fairness notions. We show that ELBERT-PO significantly diminishes bias while maintaining high utility.
arXiv Detail & Related papers (2023-09-07T01:10:01Z)
Reinforcement Learning with Stepwise Fairness Constraints [50.538878453547966]
We introduce the study of reinforcement learning with stepwise fairness constraints. We provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violation.
arXiv Detail & Related papers (2022-11-08T04:06:23Z)
Off-Policy Evaluation with Policy-Dependent Optimization Response [90.28758112893054]
We develop a new framework for off-policy evaluation with a textitpolicy-dependent linear optimization response. We construct unbiased estimators for the policy-dependent estimand by a perturbation method. We provide a general algorithm for optimizing causal interventions.
arXiv Detail & Related papers (2022-02-25T20:25:37Z)
Safe Policy Learning through Extrapolation: Application to Pre-trial Risk Assessment [0.0]
We develop a robust optimization approach that partially identifies the expected utility of a policy, and then finds an optimal policy. We extend this approach to common and important settings where humans make decisions with the aid of algorithmic recommendations. We derive new classification and recommendation rules that retain the transparency and interpretability of the existing risk assessment instrument.
arXiv Detail & Related papers (2021-09-22T00:52:03Z)
Off-Policy Imitation Learning from Observations [78.30794935265425]
Learning from Observations (LfO) is a practical reinforcement learning scenario from which many applications can benefit. We propose a sample-efficient LfO approach that enables off-policy optimization in a principled manner. Our approach is comparable with state-of-the-art locomotion in terms of both sample-efficiency and performance.
arXiv Detail & Related papers (2021-02-25T21:33:47Z)
Universal Trading for Order Execution with Oracle Policy Distillation [99.57416828489568]
We propose a novel universal trading policy optimization framework to bridge the gap between the noisy yet imperfect market states and the optimal action sequences for order execution. We show that our framework can better guide the learning of the common policy towards practically optimal execution by an oracle teacher with perfect information.
arXiv Detail & Related papers (2021-01-28T05:52:18Z)
Privacy-Constrained Policies via Mutual Information Regularized Policy Gradients [54.98496284653234]
We consider the task of training a policy that maximizes reward while minimizing disclosure of certain sensitive state variables through the actions. We solve this problem by introducing a regularizer based on the mutual information between the sensitive state and the actions. We develop a model-based estimator for optimization of privacy-constrained policies.
arXiv Detail & Related papers (2020-12-30T03:22:35Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
SOAC: The Soft Option Actor-Critic Architecture [25.198302636265286]
Methods have been proposed for concurrently learning low-level intra-option policies and high-level option selection policy. Existing methods typically suffer from two major challenges: ineffective exploration and unstable updates. We present a novel and stable off-policy approach that builds on the maximum entropy model to address these challenges.
arXiv Detail & Related papers (2020-06-25T13:06:59Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.