Related papers: Performance Optimization of Ratings-Based Reinforcement Learning

Performance Optimization of Ratings-Based Reinforcement Learning

URL: http://arxiv.org/abs/2501.07755v1
Date: Mon, 13 Jan 2025 23:56:24 GMT
Title: Performance Optimization of Ratings-Based Reinforcement Learning
Authors: Evelyn Rose, Devin White, Mingkang Wu, Vernon Lawhern, Nicholas R. Waytowich, Yongcan Cao,
Abstract summary: This paper explores multiple optimization methods to improve the performance of rating-based reinforcement learning (RbRL)<n>RbRL has been developed to infer reward functions in reward-free environments for the subsequent policy learning via standard reinforcement learning.
Score: 1.6133809033337525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper explores multiple optimization methods to improve the performance of rating-based reinforcement learning (RbRL). RbRL, a method based on the idea of human ratings, has been developed to infer reward functions in reward-free environments for the subsequent policy learning via standard reinforcement learning, which requires the availability of reward functions. Specifically, RbRL minimizes the cross entropy loss that quantifies the differences between human ratings and estimated ratings derived from the inferred reward. Hence, a low loss means a high degree of consistency between human ratings and estimated ratings. Despite its simple form, RbRL has various hyperparameters and can be sensitive to various factors. Therefore, it is critical to provide comprehensive experiments to understand the impact of various hyperparameters on the performance of RbRL. This paper is a work in progress, providing users some general guidelines on how to select hyperparameters in RbRL.

Related papers

Enhancing Rating-Based Reinforcement Learning to Effectively Leverage Feedback from Large Vision-Language Models [22.10168313140081]
We introduce ERL-VLM, an enhanced rating-based reinforcement learning method that learns reward functions from AI feedback.<n>ERL-VLM queries large vision-language models for absolute ratings of individual trajectories, enabling more expressive feedback.<n>We demonstrate that ERL-VLM significantly outperforms existing VLM-based reward generation methods.
arXiv Detail & Related papers (2025-06-15T12:05:08Z)
Multi-Task Reward Learning from Human Ratings [1.6133809033337525]
We propose a novel reinforcement learning (RL) method that mimics human decision-making by jointly considering multiple tasks.<n>We leverage human ratings in reward-free environments to infer a reward function, introducing learnable weights that balance the contributions of both classification and regression models.<n>Results show that our method consistently outperforms existing rating-based RL methods, and in some cases, even surpasses traditional RL approaches.
arXiv Detail & Related papers (2025-06-10T19:00:19Z)
Robust Reinforcement Learning from Human Feedback for Large Language Models Fine-Tuning [3.30671592417223]
Reinforcement learning from human feedback (RLHF) has emerged as a key technique for aligning the output of large language models with human preferences. Most existing RLHF algorithms use the Bradley-Terry model, which relies on assumptions about human preferences that may not reflect the complexity and variability of real-world judgments. We propose a robust algorithm to enhance the performance of existing approaches under such reward model misspecifications.
arXiv Detail & Related papers (2025-04-03T16:16:35Z)
RbRL2.0: Integrated Reward and Policy Learning for Rating-based Reinforcement Learning [1.7095639309883044]
Reinforcement learning (RL) learns policies from various experiences based on the associated cumulative return/rewards without treating them differently. This paper proposes a novel RL method that mimics humans' decision making process by differentiating among collected experiences for effective policy learning.
arXiv Detail & Related papers (2025-01-13T17:19:34Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
In deep Reinforcement Learning (RL) models trained using gradient-based techniques, the choice of gradient and its learning rate are crucial to achieving good performance. We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL), a meta-learning approach that selects the learning rate based on the agent's performance during training.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning [7.07264650720021]
Sub-optimal Data Pre-training, SDP, is an approach that leverages reward-free, sub-optimal data to improve RL algorithms. In SDP, we obtain reward labels to pre-train our reward model without requiring human labeling or preferences. We find that SDP can at least meet, but often significantly improve, state of the art human-in-the-loop RL performance.
arXiv Detail & Related papers (2024-04-30T18:58:33Z)
RIME: Robust Preference-based Reinforcement Learning with Noisy Preferences [23.414135977983953]
Preference-based Reinforcement Learning (PbRL) circumvents the need for reward engineering by harnessing human preferences as the reward signal. We present RIME, a robust PbRL algorithm for effective reward learning from noisy preferences.
arXiv Detail & Related papers (2024-02-27T07:03:25Z)
REBEL: A Regularization-Based Solution for Reward Overoptimization in Robotic Reinforcement Learning from Human Feedback [61.54791065013767]
A misalignment between the reward function and user intentions, values, or social norms can be catastrophic in the real world. Current methods to mitigate this misalignment work by learning reward functions from human preferences. We propose a novel concept of reward regularization within the robotic RLHF framework.
arXiv Detail & Related papers (2023-12-22T04:56:37Z)
Provable Reward-Agnostic Preference-Based Reinforcement Learning [61.39541986848391]
Preference-based Reinforcement Learning (PbRL) is a paradigm in which an RL agent learns to optimize a task using pair-wise preference-based feedback over trajectories. We propose a theoretical reward-agnostic PbRL framework where exploratory trajectories that enable accurate learning of hidden reward functions are acquired.
arXiv Detail & Related papers (2023-05-29T15:00:09Z)
Reward Uncertainty for Exploration in Preference-based Reinforcement Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms. Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward. Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z)
Self-supervised Representation Learning with Relative Predictive Coding [102.93854542031396]
Relative Predictive Coding (RPC) is a new contrastive representation learning objective. RPC maintains a good balance among training stability, minibatch size sensitivity, and downstream task performance. We empirically verify the effectiveness of RPC on benchmark vision and speech self-supervised learning tasks.
arXiv Detail & Related papers (2021-03-21T01:04:24Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)
Kalman meets Bellman: Improving Policy Evaluation through Value Tracking [59.691919635037216]
Policy evaluation is a key process in Reinforcement Learning (RL) We devise an optimization method, called Kalman Optimization for Value Approximation (KOVA) KOVA minimizes a regularized objective function that concerns both parameter and noisy return uncertainties.
arXiv Detail & Related papers (2020-02-17T13:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.