Model-Augmented Q-learning
- URL: http://arxiv.org/abs/2102.03866v1
- Date: Sun, 7 Feb 2021 17:56:50 GMT
- Title: Model-Augmented Q-learning
- Authors: Youngmin Oh, Jinwoo Shin, Eunho Yang, Sung Ju Hwang
- Abstract summary: We propose a MFRL framework that is augmented with the components of model-based RL.
Specifically, we propose to estimate not only the $Q$-values but also both the transition and the reward with a shared network.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL), obtains a policy-invariant solution which is identical to the solution obtained by learning with true reward.
- Score: 112.86795579978802
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, $Q$-learning has become indispensable for model-free
reinforcement learning (MFRL). However, it suffers from well-known problems
such as under- and overestimation bias of the value, which may adversely affect
the policy learning. To resolve this issue, we propose a MFRL framework that is
augmented with the components of model-based RL. Specifically, we propose to
estimate not only the $Q$-values but also both the transition and the reward
with a shared network. We further utilize the estimated reward from the model
estimators for $Q$-learning, which promotes interaction between the estimators.
We show that the proposed scheme, called Model-augmented $Q$-learning (MQL),
obtains a policy-invariant solution which is identical to the solution obtained
by learning with true reward. Finally, we also provide a trick to prioritize
past experiences in the replay buffer by utilizing model-estimation errors. We
experimentally validate MQL built upon state-of-the-art off-policy MFRL
methods, and show that MQL largely improves their performance and convergence.
The proposed scheme is simple to implement and does not require additional
training cost.
Related papers
- Rethinking Bradley-Terry Models in Preference-Based Reward Modeling: Foundations, Theory, and Alternatives [14.401557416713315]
We revisit the foundations of using Bradley-Terry (BT) models in reward modeling.
We argue that the BT model is not a necessary choice from the perspective of downstream optimization.
We propose a simple and straightforward upper-bound algorithm, compatible with off-the-shelf binary classifiers.
arXiv Detail & Related papers (2024-11-07T18:57:03Z) - MetaTrading: An Immersion-Aware Model Trading Framework for Vehicular Metaverse Services [94.61039892220037]
We present a novel immersion-aware model trading framework that incentivizes metaverse users (MUs) to contribute learning models for augmented reality (AR) services in the vehicular metaverse.
Considering dynamic network conditions and privacy concerns, we formulate the reward decisions of MSPs as a multi-agent Markov decision process.
Experimental results demonstrate that the proposed framework can effectively provide higher-value models for object detection and classification in AR services on real AR-related vehicle datasets.
arXiv Detail & Related papers (2024-10-25T16:20:46Z) - Getting More Juice Out of the SFT Data: Reward Learning from Human Demonstration Improves SFT for LLM Alignment [65.15914284008973]
We propose to leverage an Inverse Reinforcement Learning (IRL) technique to simultaneously build an reward model and a policy model.
We show that the proposed algorithms converge to the stationary solutions of the IRL problem.
Our results indicate that it is beneficial to leverage reward learning throughout the entire alignment process.
arXiv Detail & Related papers (2024-05-28T07:11:05Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z) - Finite-Time Error Analysis of Online Model-Based Q-Learning with a
Relaxed Sampling Model [6.663174194579773]
$Q$-learning has proven to be a powerful algorithm in model-free settings.
The extension of $Q$-learning to a model-based framework remains relatively unexplored.
arXiv Detail & Related papers (2024-02-19T06:33:51Z) - Model Sparsity Can Simplify Machine Unlearning [33.18951938708467]
In response to recent data regulation requirements, machine unlearning (MU) has emerged as a critical process.
Our study introduces a novel model-based perspective: model sparsification via weight pruning.
We show in both theory and practice that model sparsity can boost the multi-criteria unlearning performance of an approximate unlearner.
arXiv Detail & Related papers (2023-04-11T02:12:02Z) - Principled Reinforcement Learning with Human Feedback from Pairwise or
$K$-wise Comparisons [79.98542868281473]
We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF)
We show that when training a policy based on the learned reward model, MLE fails while a pessimistic MLE provides policies with improved performance under certain coverage assumptions.
arXiv Detail & Related papers (2023-01-26T18:07:21Z) - Value Gradient weighted Model-Based Reinforcement Learning [28.366157882991565]
Model-based reinforcement learning (MBRL) is a sample efficient technique to obtain control policies.
VaGraM is a novel method for value-aware model learning.
arXiv Detail & Related papers (2022-04-04T13:28:31Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Model Embedding Model-Based Reinforcement Learning [4.566180616886624]
Model-based reinforcement learning (MBRL) has shown its advantages in sample-efficiency over model-free reinforcement learning (MFRL)
Despite the impressive results it achieves, it still faces a trade-off between the ease of data generation and model bias.
We propose a simple and elegant model-embedding model-based reinforcement learning (MEMB) algorithm in the framework of the probabilistic reinforcement learning.
arXiv Detail & Related papers (2020-06-16T15:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.