Related papers: Deep Reinforcement Learning from Hierarchical Preference Design

Deep Reinforcement Learning from Hierarchical Preference Design

URL: http://arxiv.org/abs/2309.02632v3
Date: Mon, 10 Jun 2024 13:22:42 GMT
Title: Deep Reinforcement Learning from Hierarchical Preference Design
Authors: Alexander Bukharin, Yixiao Li, Pengcheng He, Tuo Zhao,
Abstract summary: This paper shows by exploiting certain structures, one can ease the reward design process. We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
Score: 99.46415116087259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reward design is a fundamental, yet challenging aspect of reinforcement learning (RL). Researchers typically utilize feedback signals from the environment to handcraft a reward function, but this process is not always effective due to the varying scale and intricate dependencies of the feedback signals. This paper shows by exploiting certain structures, one can ease the reward design process. Specifically, we propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning. Both scenarios allow us to design a hierarchical decision tree induced by the importance ranking of the feedback signals to compare RL trajectories. With such preference data, we can then train a reward model for policy learning. We apply HERON to several RL applications, and we find that our framework can not only train high performing agents on a variety of difficult tasks, but also provide additional benefits such as improved sample efficiency and robustness. Our code is available at \url{https://github.com/abukharin3/HERON}.

Related papers

Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains [8.143110220871614]
We introduce RaR, a framework that uses structured, checklist-style rubrics as interpretable reward signals.<n>By treating rubrics as structured reward signals, we show that RaR enables smaller-scale judge models to better align with human preferences.
arXiv Detail & Related papers (2025-07-23T17:57:55Z)
Intra-Trajectory Consistency for Reward Modeling [67.84522106537274]
We develop an intra-trajectory consistency regularization to enforce that adjacent processes with higher next-token generation probability maintain more consistent rewards.<n>We show that the reward model trained with the proposed regularization induces better DPO-aligned policies and achieves better best-of-N (BON) inference-time verification results.
arXiv Detail & Related papers (2025-06-10T12:59:14Z)
ToolRL: Reward is All Tool Learning Needs [54.16305891389931]
Large Language Models (LLMs) often undergo supervised fine-tuning (SFT) to acquire tool use capabilities. Recent advancements in reinforcement learning (RL) have demonstrated promising reasoning and generalization abilities. We present the first comprehensive study on reward design for tool selection and application tasks within the RL paradigm.
arXiv Detail & Related papers (2025-04-16T21:45:32Z)
Offline Reinforcement Learning with Imputed Rewards [8.856568375969848]
We propose a Reward Model that can estimate the reward signal from a very limited sample of environment transitions annotated with rewards. Our results show that, using only 1% of reward-labeled transitions from the original datasets, our learned reward model is able to impute rewards for the remaining 99% of the transitions.
arXiv Detail & Related papers (2024-07-15T15:53:13Z)
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback [24.759613248409167]
Reward engineering has long been a challenge in Reinforcement Learning research. We propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains.
arXiv Detail & Related papers (2024-02-06T04:06:06Z)
Dense Reward for Free in Reinforcement Learning from Human Feedback [64.92448888346125]
We leverage the fact that the reward model contains more information than just its scalar output. We use these attention weights to redistribute the reward along the whole completion. Empirically, we show that it stabilises training, accelerates the rate of learning, and, in practical cases, may lead to better local optima.
arXiv Detail & Related papers (2024-02-01T17:10:35Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Interpretable Preference-based Reinforcement Learning with Tree-Structured Reward Functions [2.741266294612776]
We propose an online, active preference learning algorithm that constructs reward functions with the intrinsically interpretable, compositional structure of a tree. We demonstrate sample-efficient learning of tree-structured reward functions in several environments, then harness the enhanced interpretability to explore and debug for alignment.
arXiv Detail & Related papers (2021-12-20T09:53:23Z)
Generative Adversarial Reward Learning for Generalized Behavior Tendency Inference [71.11416263370823]
We propose a generative inverse reinforcement learning for user behavioral preference modelling. Our model can automatically learn the rewards from user's actions based on discriminative actor-critic network and Wasserstein GAN.
arXiv Detail & Related papers (2021-05-03T13:14:25Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks. Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL. Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.