Related papers: In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning

URL: http://arxiv.org/abs/2412.09104v1
Date: Thu, 12 Dec 2024 09:35:47 GMT
Title: In-Dataset Trajectory Return Regularization for Offline Preference-based Reinforcement Learning
Authors: Songjun Tu, Jingbo Sun, Qichao Zhang, Yaocheng Zhang, Jia Liu, Ke Chen, Dongbin Zhao,
Abstract summary: We propose In-Dataset Trajectory Return Regularization (DTR) for offline preference-based reinforcement learning.<n>DTR mitigates the risk of learning inaccurate trajectory stitching under reward bias.<n>We also introduce an ensemble normalization technique that effectively integrates multiple reward models.
Score: 15.369324784520538
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline preference-based reinforcement learning (PbRL) typically operates in two phases: first, use human preferences to learn a reward model and annotate rewards for a reward-free offline dataset; second, learn a policy by optimizing the learned reward via offline RL. However, accurately modeling step-wise rewards from trajectory-level preference feedback presents inherent challenges. The reward bias introduced, particularly the overestimation of predicted rewards, leads to optimistic trajectory stitching, which undermines the pessimism mechanism critical to the offline RL phase. To address this challenge, we propose In-Dataset Trajectory Return Regularization (DTR) for offline PbRL, which leverages conditional sequence modeling to mitigate the risk of learning inaccurate trajectory stitching under reward bias. Specifically, DTR employs Decision Transformer and TD-Learning to strike a balance between maintaining fidelity to the behavior policy with high in-dataset trajectory returns and selecting optimal actions based on high reward labels. Additionally, we introduce an ensemble normalization technique that effectively integrates multiple reward models, balancing the tradeoff between reward differentiation and accuracy. Empirical evaluations on various benchmarks demonstrate the superiority of DTR over other state-of-the-art baselines

Related papers

Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z)
In-Context Reinforcement Learning From Suboptimal Historical Data [56.60512975858003]
Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities.<n>We propose the Decision Importance Transformer framework, which emulates the actor-critic algorithm in an in-context manner.<n>Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.
arXiv Detail & Related papers (2026-01-27T23:13:06Z)
Double Check My Desired Return: Transformer with Target Alignment for Offline Reinforcement Learning [64.6334337560557]
Reinforcement learning via supervised learning (RvS) frames offline RL as a sequence modeling task.<n>Decision Transformer (DT) struggles to reliably align the actual achieved returns with specified target returns.<n>We propose Doctor, a novel approach that Double Checks the Transformer with target alignment for Offline RL.
arXiv Detail & Related papers (2025-08-22T14:30:53Z)
Your Offline Policy is Not Trustworthy: Bilevel Reinforcement Learning for Sequential Portfolio Optimization [82.03139922490796]
Reinforcement learning (RL) has shown significant promise for sequential portfolio optimization tasks, such as stock trading, where the objective is to maximize cumulative returns while minimizing risks using historical data.<n>Traditional RL approaches often produce policies that merely memorize the optimal yet impractical buying and selling behaviors within the fixed dataset.<n>Our approach frames portfolio optimization as a new type of partial-offline RL problem and makes two technical contributions.
arXiv Detail & Related papers (2025-05-19T06:37:25Z)
Diffusion Classifier-Driven Reward for Offline Preference-based Reinforcement Learning [45.95668702930697]
We propose a novel preference-based reward acquisition method: Diffusion Preference-based Reward (DPR)<n>DPR directly treats step-wise preference-based reward acquisition as a binary classification and utilizes the robustness of diffusion classifiers to infer step-wise rewards discriminatively.<n>We also propose Diffusion Preference-based Reward (C-DPR), which conditions on trajectory-wise preference labels to enhance reward inference.
arXiv Detail & Related papers (2025-03-03T03:49:38Z)
Best Policy Learning from Trajectory Preference Feedback [15.799929216215672]
We address the problem of best policy identification in preference-based reinforcement learning (PbRL) We propose Posterior Sampling for Preference Learning ($mathsfPSPL$), a novel algorithm inspired by Top-Two Thompson Sampling. We provide the first theoretical guarantees for PbRL in this setting, establishing an upper bound on the simple Bayesian regret.
arXiv Detail & Related papers (2025-01-31T03:55:10Z)
LEASE: Offline Preference-based Reinforcement Learning with High Sample Efficiency [11.295036269748731]
This paper proposes a offLine prEference-bAsed RL with high Sample Efficiency (LEASE) algorithm to generate unlabeled preference data. Considering the pretrained reward model may generate incorrect labels for unlabeled data, we design an uncertainty-aware mechanism to ensure the performance of reward model.
arXiv Detail & Related papers (2024-12-30T15:10:57Z)
Solving the Inverse Alignment Problem for Efficient RLHF [0.0]
We define the 'inverse alignment problem' in language model training. We investigate whether repeatedly fine-tuning a reward model on subsets of the offline preference dataset aligned with a periodically frozen policy improves upon vanilla RLHF.
arXiv Detail & Related papers (2024-12-13T19:47:38Z)
Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning [5.398202201395825]
Decision Transformer (DT) has demonstrated exceptional capabilities in offline reinforcement learning. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values.
arXiv Detail & Related papers (2024-09-12T14:10:22Z)
Listwise Reward Estimation for Offline Preference-based Reinforcement Learning [20.151932308777553]
Listwise Reward Estimation (LiRE) is a novel approach for offline Preference-based Reinforcement Learning (PbRL) LiRE builds on existing PbRL methods by constructing a Ranked List of Trajectories (RLT) Our experiments demonstrate the superiority of LiRE, even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise.
arXiv Detail & Related papers (2024-08-08T03:18:42Z)
ROLeR: Effective Reward Shaping in Offline Reinforcement Learning for Recommender Systems [14.74207332728742]
offline reinforcement learning (RL) is an effective tool for real-world recommender systems. This paper proposes a novel model-based Reward Shaping in Offline Reinforcement Learning for Recommender Systems, ROLeR, for reward and uncertainty estimation.
arXiv Detail & Related papers (2024-07-18T05:07:11Z)
Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset. We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments. Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z)
Preference Elicitation for Offline Reinforcement Learning [59.136381500967744]
We propose Sim-OPRL, an offline preference-based reinforcement learning algorithm. Our algorithm employs a pessimistic approach for out-of-distribution data, and an optimistic approach for acquiring informative preferences about the optimal policy.
arXiv Detail & Related papers (2024-06-26T15:59:13Z)
Value-Incentivized Preference Optimization: A Unified Approach to Online and Offline RLHF [80.32171988565999]
We introduce a unified approach to online and offline RLHF -- value-incentivized preference optimization (VPO) VPO regularizes the maximum-likelihood estimate of the reward function with the corresponding value function. Experiments on text summarization and dialog verify the practicality and effectiveness of VPO.
arXiv Detail & Related papers (2024-05-29T17:51:42Z)
Robust Preference Optimization through Reward Model Distillation [68.65844394615702]
Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data. We analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations.
arXiv Detail & Related papers (2024-05-29T17:39:48Z)
Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL) QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM) Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z)
Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning. Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z)
Offline Meta-Reinforcement Learning with Online Self-Supervision [66.42016534065276]
We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy. Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data. We find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
arXiv Detail & Related papers (2021-07-08T17:01:32Z)
Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning [81.15016852963676]
We re-implement state-of-the-art baselines in the offline RL regime under a fair, unified, and highly factorized framework. We show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end.
arXiv Detail & Related papers (2021-07-03T11:00:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.