Achieving Counterfactual Fairness for Causal Bandit
- URL: http://arxiv.org/abs/2109.10458v1
- Date: Tue, 21 Sep 2021 23:44:48 GMT
- Title: Achieving Counterfactual Fairness for Causal Bandit
- Authors: Wen Huang, Lu Zhang, Xintao Wu
- Abstract summary: We study how to recommend an item at each step to maximize the expected reward.
We then propose the fair causal bandit (F-UCB) for achieving the counterfactual individual fairness.
- Score: 18.077963117600785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In online recommendation, customers arrive in a sequential and stochastic
manner from an underlying distribution and the online decision model recommends
a chosen item for each arriving individual based on some strategy. We study how
to recommend an item at each step to maximize the expected reward while
achieving user-side fairness for customers, i.e., customers who share similar
profiles will receive a similar reward regardless of their sensitive attributes
and items being recommended. By incorporating causal inference into bandits and
adopting soft intervention to model the arm selection strategy, we first
propose the d-separation based UCB algorithm (D-UCB) to explore the utilization
of the d-separation set in reducing the amount of exploration needed to achieve
low cumulative regret. Based on that, we then propose the fair causal bandit
(F-UCB) for achieving the counterfactual individual fairness. Both theoretical
analysis and empirical evaluation demonstrate effectiveness of our algorithms.
Related papers
- Meta Clustering of Neural Bandits [45.77505279698894]
We study a new problem, Clustering of Neural Bandits, by extending previous work to the arbitrary reward function.
We propose a novel algorithm called M-CNB, which utilizes a meta-learner to represent and rapidly adapt to dynamic clusters.
In extensive experiments conducted in both recommendation and online classification scenarios, M-CNB outperforms SOTA baselines.
arXiv Detail & Related papers (2024-08-10T16:09:51Z) - Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Language model (LM) post-training involves maximizing a reward function that is derived from preference annotations.
DPO is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning.
We analyze this phenomenon and propose distillation to get a better proxy for the true preference distribution over generation pairs.
arXiv Detail & Related papers (2024-05-29T17:39:48Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - Fairness via Adversarial Attribute Neighbourhood Robust Learning [49.93775302674591]
We propose a principled underlineRobust underlineAdversarial underlineAttribute underlineNeighbourhood (RAAN) loss to debias the classification head.
arXiv Detail & Related papers (2022-10-12T23:39:28Z) - Recommendation Systems with Distribution-Free Reliability Guarantees [83.80644194980042]
We show how to return a set of items rigorously guaranteed to contain mostly good items.
Our procedure endows any ranking model with rigorous finite-sample control of the false discovery rate.
We evaluate our methods on the Yahoo! Learning to Rank and MSMarco datasets.
arXiv Detail & Related papers (2022-07-04T17:49:25Z) - The Unfairness of Active Users and Popularity Bias in Point-of-Interest
Recommendation [4.578469978594752]
This paper studies the interplay between (i) the unfairness of active users, (ii) the unfairness of popular items, and (iii) the accuracy of recommendation as three angles of our study triangle.
For item fairness, we divide items into short-head, mid-tail, and long-tail groups and study the exposure of these item groups into the top-k recommendation list of users.
Our study shows that most recommendation models cannot satisfy both consumer and producer fairness, indicating a trade-off between these variables possibly due to natural biases in data.
arXiv Detail & Related papers (2022-02-27T08:02:19Z) - Bias-Robust Bayesian Optimization via Dueling Bandit [57.82422045437126]
We consider Bayesian optimization in settings where observations can be adversarially biased.
We propose a novel approach for dueling bandits based on information-directed sampling (IDS)
Thereby, we obtain the first efficient kernelized algorithm for dueling bandits that comes with cumulative regret guarantees.
arXiv Detail & Related papers (2021-05-25T10:08:41Z) - Continuous Mean-Covariance Bandits [39.820490484375156]
We propose a novel Continuous Mean-Covariance Bandit model to take into account option correlation.
In CMCB, there is a learner who sequentially chooses weight vectors on given options and observes random feedback according to the decisions.
We propose novel algorithms with optimal regrets (within logarithmic factors) and provide matching lower bounds to validate their optimalities.
arXiv Detail & Related papers (2021-02-24T06:37:05Z) - Causality-Aware Neighborhood Methods for Recommender Systems [3.0919302844782717]
Business objectives of recommenders, such as increasing sales, are aligned with the causal effect of recommendations.
Previous recommenders employ the inverse propensity scoring (IPS) in causal inference.
We develop robust ranking methods for the causal effect of recommendations.
arXiv Detail & Related papers (2020-12-17T08:23:17Z) - Achieving User-Side Fairness in Contextual Bandits [17.947543703195738]
We study how to achieve user-side fairness in personalized recommendation.
We formulate our fair personalized recommendation as a modified contextual bandit.
We develop a fair contextual bandit algorithm, Fair-LinUCB, that improves upon the traditional LinUCB algorithm.
arXiv Detail & Related papers (2020-10-22T22:58:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.