DIRECT: Learning from Sparse and Shifting Rewards using Discriminative
Reward Co-Training
- URL: http://arxiv.org/abs/2301.07421v1
- Date: Wed, 18 Jan 2023 10:42:00 GMT
- Title: DIRECT: Learning from Sparse and Shifting Rewards using Discriminative
Reward Co-Training
- Authors: Philipp Altmann, Thomy Phan, Fabian Ritz, Thomas Gabor and Claudia
Linnhoff-Popien
- Abstract summary: We propose discriminative reward co-training as an extension to deep reinforcement learning algorithms.
A discriminator network is trained concurrently to the policy to distinguish between trajectories generated by the current policy and beneficial trajectories generated by previous policies.
Our results show that DIRECT outperforms state-of-the-art algorithms in sparse- and shifting-reward environments.
- Score: 13.866486498822228
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose discriminative reward co-training (DIRECT) as an extension to deep
reinforcement learning algorithms. Building upon the concept of self-imitation
learning (SIL), we introduce an imitation buffer to store beneficial
trajectories generated by the policy determined by their return. A
discriminator network is trained concurrently to the policy to distinguish
between trajectories generated by the current policy and beneficial
trajectories generated by previous policies. The discriminator's verdict is
used to construct a reward signal for optimizing the policy. By interpolating
prior experience, DIRECT is able to act as a surrogate, steering policy
optimization towards more valuable regions of the reward landscape thus
learning an optimal policy. Our results show that DIRECT outperforms
state-of-the-art algorithms in sparse- and shifting-reward environments being
able to provide a surrogate reward to the policy and direct the optimization
towards valuable areas.
Related papers
- Off-Dynamics Reinforcement Learning via Domain Adaptation and Reward Augmented Imitation [19.37193250533054]
We propose to utilize imitation learning to transfer the policy learned from the reward modification to the target domain.
Our approach, Domain Adaptation and Reward Augmented Imitation Learning (DARAIL), utilizes the reward modification for domain adaptation.
arXiv Detail & Related papers (2024-11-15T02:35:20Z) - Forward KL Regularized Preference Optimization for Aligning Diffusion Policies [8.958830452149789]
A central problem for learning diffusion policies is to align the policy output with human intents in various tasks.
We propose a novel framework, Forward KL regularized Preference optimization, to align the diffusion policy with preferences directly.
The results show our method exhibits superior alignment with preferences and outperforms previous state-of-the-art algorithms.
arXiv Detail & Related papers (2024-09-09T13:56:03Z) - PG-Rainbow: Using Distributional Reinforcement Learning in Policy Gradient Methods [0.0]
We introduce PG-Rainbow, a novel algorithm that incorporates a distributional reinforcement learning framework with a policy gradient algorithm.
We show empirical results that through the integration of reward distribution information into the policy network, the policy agent acquires enhanced capabilities.
arXiv Detail & Related papers (2024-07-18T04:18:52Z) - IOB: Integrating Optimization Transfer and Behavior Transfer for
Multi-Policy Reuse [50.90781542323258]
Reinforcement learning (RL) agents can transfer knowledge from source policies to a related target task.
Previous methods introduce additional components, such as hierarchical policies or estimations of source policies' value functions.
We propose a novel transfer RL method that selects the source policy without training extra components.
arXiv Detail & Related papers (2023-08-14T09:22:35Z) - Acceleration in Policy Optimization [50.323182853069184]
We work towards a unifying paradigm for accelerating policy optimization methods in reinforcement learning (RL) by integrating foresight in the policy improvement step via optimistic and adaptive updates.
We define optimism as predictive modelling of the future behavior of a policy, and adaptivity as taking immediate and anticipatory corrective actions to mitigate errors from overshooting predictions or delayed responses to change.
We design an optimistic policy gradient algorithm, adaptive via meta-gradient learning, and empirically highlight several design choices pertaining to acceleration, in an illustrative task.
arXiv Detail & Related papers (2023-06-18T15:50:57Z) - Offline Reinforcement Learning with Closed-Form Policy Improvement
Operators [88.54210578912554]
Behavior constrained policy optimization has been demonstrated to be a successful paradigm for tackling Offline Reinforcement Learning.
In this paper, we propose our closed-form policy improvement operators.
We empirically demonstrate their effectiveness over state-of-the-art algorithms on the standard D4RL benchmark.
arXiv Detail & Related papers (2022-11-29T06:29:26Z) - Rewards Encoding Environment Dynamics Improves Preference-based
Reinforcement Learning [4.969254618158096]
We show that encoding environment dynamics in the reward function (REED) dramatically reduces the number of preference labels required in state-of-the-art preference-based RL frameworks.
For some domains, REED-based reward functions result in policies that outperform policies trained on the ground truth reward.
arXiv Detail & Related papers (2022-11-12T00:34:41Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - Efficient Deep Reinforcement Learning via Adaptive Policy Transfer [50.51637231309424]
Policy Transfer Framework (PTF) is proposed to accelerate Reinforcement Learning (RL)
Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it.
Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods.
arXiv Detail & Related papers (2020-02-19T07:30:57Z) - Population-Guided Parallel Policy Search for Reinforcement Learning [17.360163137926]
A new population-guided parallel learning scheme is proposed to enhance the performance of off-policy reinforcement learning (RL)
In the proposed scheme, multiple identical learners with their own value-functions and policies share a common experience replay buffer, and search a good policy in collaboration with the guidance of the best policy information.
arXiv Detail & Related papers (2020-01-09T10:13:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.