FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning
- URL: http://arxiv.org/abs/2506.08062v1
- Date: Mon, 09 Jun 2025 09:40:11 GMT
- Title: FairDICE: Fairness-Driven Offline Multi-Objective Reinforcement Learning
- Authors: Woosung Kim, Jinho Lee, Jongmin Lee, Byung-Jun Lee,
- Abstract summary: We present FairDICE, the first offline MORL framework directly optimized nonlinear welfare objective.<n>Across multiple offline benchmarks, FairDICE demonstrates strong fairness-aware performance compared to existing baselines.
- Score: 13.825782649016851
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-objective reinforcement learning (MORL) aims to optimize policies in the presence of conflicting objectives, where linear scalarization is commonly used to reduce vector-valued returns into scalar signals. While effective for certain preferences, this approach cannot capture fairness-oriented goals such as Nash social welfare or max-min fairness, which require nonlinear and non-additive trade-offs. Although several online algorithms have been proposed for specific fairness objectives, a unified approach for optimizing nonlinear welfare criteria in the offline setting-where learning must proceed from a fixed dataset-remains unexplored. In this work, we present FairDICE, the first offline MORL framework that directly optimizes nonlinear welfare objective. FairDICE leverages distribution correction estimation to jointly account for welfare maximization and distributional regularization, enabling stable and sample-efficient learning without requiring explicit preference weights or exhaustive weight search. Across multiple offline benchmarks, FairDICE demonstrates strong fairness-aware performance compared to existing baselines.
Related papers
- Generalized Linear Bandits: Almost Optimal Regret with One-Pass Update [60.414548453838506]
We study the generalized linear bandit (GLB) problem, a contextual multi-armed bandit framework that extends the classical linear model by incorporating a non-linear link function.<n>GLBs are widely applicable to real-world scenarios, but their non-linear nature introduces significant challenges in achieving both computational and statistical efficiency.<n>We propose a jointly efficient algorithm that attains a nearly optimal regret bound with $mathcalO(1)$ time and space complexities per round.
arXiv Detail & Related papers (2025-07-16T02:24:21Z) - Generalized Linear Markov Decision Process [9.219628236765933]
Generalized Linear MDP (GLMDP) framework models rewards using generalized linear models (GLMs)<n>We develop two offline RL algorithms: Generalized Pessimistic Value Iteration (GPEVI) and a semi-supervised variant (SS-GPEVI)<n>Our algorithms achieve theoretical guarantees on policy suboptimality and demonstrate improved sample efficiency in settings where reward labels are expensive or limited.
arXiv Detail & Related papers (2025-06-01T03:50:41Z) - LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency [11.295036269748731]
This paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm to generate unlabeled preference data.<n>Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model.
arXiv Detail & Related papers (2024-12-30T15:10:57Z) - Correcting the Mythos of KL-Regularization: Direct Alignment without Overoptimization via Chi-Squared Preference Optimization [78.82586283794886]
$chi2$-Preference Optimization ($chi$PO) is an efficient offline alignment algorithm provably robust to overoptimization.<n>$chi$PO implements the principle of pessimism in the face of uncertainty via regularization.<n>$chi$PO's simplicity and strong guarantees make it the first practical and general-purpose offline alignment algorithm provably robust to overoptimization.
arXiv Detail & Related papers (2024-07-18T11:08:40Z) - Self-Play with Adversarial Critic: Provable and Scalable Offline Alignment for Language Models [44.38073745307387]
This work studies the challenge of aligning large language models (LLMs) with offline preference data.
We propose SPAC, a new offline preference optimization method with self-play, inspired by the on-average pessimism technique from the offline RL literature.
arXiv Detail & Related papers (2024-06-06T17:23:49Z) - Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO)<n>VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function.<n>Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z) - Optimal Baseline Corrections for Off-Policy Contextual Bandits [61.740094604552475]
We aim to learn decision policies that optimize an unbiased offline estimate of an online reward metric.
We propose a single framework built on their equivalence in learning scenarios.
Our framework enables us to characterize the variance-optimal unbiased estimator and provide a closed-form solution for it.
arXiv Detail & Related papers (2024-05-09T12:52:22Z) - Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data [102.16105233826917]
Learning from preference labels plays a crucial role in fine-tuning large language models.
There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning.
arXiv Detail & Related papers (2024-04-22T17:20:18Z) - Goal-conditioned Offline Reinforcement Learning through State Space Partitioning [9.38848713730931]
offline reinforcement learning (RL) aims to infer sequential decision policies using only offline datasets.
We argue that, despite its benefits, this approach is still insufficient to fully address the distribution shift and multi-modality problems.
We propose a complementary advantage-based weighting scheme that introduces an additional source of inductive bias.
arXiv Detail & Related papers (2023-03-16T14:52:53Z) - Offline Reinforcement Learning with Adaptive Behavior Regularization [1.491109220586182]
offline reinforcement learning (RL) defines a sample-efficient learning paradigm, where a policy is learned from static and previously collected datasets.
We propose a novel approach, which we refer to as adaptive behavior regularization (ABR)
ABR enables the policy to adaptively adjust its optimization objective between cloning and improving over the policy used to generate the dataset.
arXiv Detail & Related papers (2022-11-15T15:59:11Z) - False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm.
We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z) - Where is the Grass Greener? Revisiting Generalized Policy Iteration for
Offline Reinforcement Learning [81.15016852963676]
We re-implement state-of-the-art baselines in the offline RL regime under a fair, unified, and highly factorized framework.
We show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end.
arXiv Detail & Related papers (2021-07-03T11:00:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.