Related papers: Pessimistic Auxiliary Policy for Offline Reinforcement Learning

Pessimistic Auxiliary Policy for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2602.23974v2
Date: Thu, 05 Mar 2026 09:03:30 GMT
Title: Pessimistic Auxiliary Policy for Offline Reinforcement Learning
Authors: Fan Zhang, Baoru Huang, Xin Zhang,
Abstract summary: We construct a new pessimistic auxiliary policy for sampling reliable actions.<n>The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy.<n>Experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.
Score: 9.466490274149955
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline reinforcement learning aims to learn an agent from pre-collected datasets, avoiding unsafe and inefficient real-time interaction. However, inevitable access to out-ofdistribution actions during the learning process introduces approximation errors, causing the error accumulation and considerable overestimation. In this paper, we construct a new pessimistic auxiliary policy for sampling reliable actions. Specifically, we develop a pessimistic auxiliary strategy by maximizing the lower confidence bound of the Q-function. The pessimistic auxiliary strategy exhibits a relatively high value and low uncertainty in the vicinity of the learned policy, avoiding the learned policy sampling high-value actions with potentially high errors during the learning process. Less approximation error introduced by sampled action from pessimistic auxiliary strategy leads to the alleviation of error accumulation. Extensive experiments on offline reinforcement learning benchmarks reveal that utilizing the pessimistic auxiliary strategy can effectively improve the efficacy of other offline RL approaches.

Related papers

Why Do Unlearnable Examples Work: A Novel Perspective of Mutual Information [55.75102049412629]
We show that effective unlearnable examples always decrease mutual information between clean features and poisoned features.<n>We propose a novel unlearnable method called Mutual Information Unlearnable Examples (MI-UE)<n>Our approach significantly outperforms the previous methods, even under defense mechanisms.
arXiv Detail & Related papers (2026-03-04T04:53:29Z)
Hindsight-DICE: Stable Credit Assignment for Deep Reinforcement Learning [11.084321518414226]
We adapt existing importance-sampling ratio estimation techniques for off-policy evaluation to drastically improve the stability and efficiency of so-called hindsight policy methods. Our hindsight distribution correction facilitates stable, efficient learning across a broad range of environments where credit assignment plagues baseline methods.
arXiv Detail & Related papers (2023-07-21T20:54:52Z)
Assessor-Guided Learning for Continual Environments [17.181933166255448]
This paper proposes an assessor-guided learning strategy for continual learning. An assessor guides the learning process of a base learner by controlling the direction and pace of the learning process. The assessor is trained in a meta-learning manner with a meta-objective to boost the learning process of the base learner.
arXiv Detail & Related papers (2023-03-21T06:45:14Z)
Uncertainty-Aware Instance Reweighting for Off-Policy Learning [63.31923483172859]
We propose a Uncertainty-aware Inverse Propensity Score estimator (UIPS) for improved off-policy learning. Experiment results on synthetic and three real-world recommendation datasets demonstrate the advantageous sample efficiency of the proposed UIPS estimator.
arXiv Detail & Related papers (2023-03-11T11:42:26Z)
Imitating, Fast and Slow: Robust learning from demonstrations via decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning. We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z)
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes. A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z)
Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z)
False Correlation Reduction for Offline Reinforcement Learning [115.11954432080749]
We propose falSe COrrelation REduction (SCORE) for offline RL, a practically effective and theoretically provable algorithm. We empirically show that SCORE achieves the SoTA performance with 3.1x acceleration on various tasks in a standard benchmark (D4RL)
arXiv Detail & Related papers (2021-10-24T15:34:03Z)
Off-policy Reinforcement Learning with Optimistic Exploration and Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework. To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z)
Learning Pessimism for Robust and Efficient Off-Policy Reinforcement Learning [0.0]
Off-policy deep reinforcement learning algorithms compensate for overestimation bias during temporal-difference learning. In this work, we propose a novel learnable penalty to enact such pessimism. We also propose to learn the penalty alongside the critic with dual TD-learning, a strategy to estimate and minimize the bias magnitude in the target returns.
arXiv Detail & Related papers (2021-10-07T12:13:19Z)
Reducing Conservativeness Oriented Offline Reinforcement Learning [29.895142928565228]
In offline reinforcement learning, a policy learns to maximize cumulative rewards with a fixed collection of data. We propose the method of reducing conservativeness oriented reinforcement learning. Our proposed method is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function.
arXiv Detail & Related papers (2021-02-27T01:21:01Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.