Effects of sparse rewards of different magnitudes in the speed of
learning of model-based actor critic methods
- URL: http://arxiv.org/abs/2001.06725v1
- Date: Sat, 18 Jan 2020 20:52:05 GMT
- Title: Effects of sparse rewards of different magnitudes in the speed of
learning of model-based actor critic methods
- Authors: Juan Vargas, Lazar Andjelic, Amir Barati Farimani
- Abstract summary: We show that we can influence an agent to learn faster by applying an external environmental pressure during training.
Results have been shown to be valid for Deep Deterministic Policy Gradients using Hindsight Experience Replay in a well known Mujoco environment.
- Score: 0.4640835690336653
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Actor critic methods with sparse rewards in model-based deep reinforcement
learning typically require a deterministic binary reward function that reflects
only two possible outcomes: if, for each step, the goal has been achieved or
not. Our hypothesis is that we can influence an agent to learn faster by
applying an external environmental pressure during training, which adversely
impacts its ability to get higher rewards. As such, we deviate from the
classical paradigm of sparse rewards and add a uniformly sampled reward value
to the baseline reward to show that (1) sample efficiency of the training
process can be correlated to the adversity experienced during training, (2) it
is possible to achieve higher performance in less time and with less resources,
(3) we can reduce the performance variability experienced seed over seed, (4)
there is a maximum point after which more pressure will not generate better
results, and (5) that random positive incentives have an adverse effect when
using a negative reward strategy, making an agent under those conditions learn
poorly and more slowly. These results have been shown to be valid for Deep
Deterministic Policy Gradients using Hindsight Experience Replay in a well
known Mujoco environment, but we argue that they could be generalized to other
methods and environments as well.
Related papers
- DreamSmooth: Improving Model-based Reinforcement Learning via Reward
Smoothing [60.21269454707625]
DreamSmooth learns to predict a temporally-smoothed reward, instead of the exact reward at the given timestep.
We show that DreamSmooth achieves state-of-the-art performance on long-horizon sparse-reward tasks.
arXiv Detail & Related papers (2023-11-02T17:57:38Z) - The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for
Improving Adversarial Training [72.39526433794707]
Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples.
We propose a novel adversarial training scheme that encourages the model to produce similar outputs for an adversarial example and its inverse adversarial'' counterpart.
Our training method achieves state-of-the-art robustness as well as natural accuracy.
arXiv Detail & Related papers (2022-11-01T15:24:26Z) - Distributional Reward Estimation for Effective Multi-Agent Deep
Reinforcement Learning [19.788336796981685]
We propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL)
Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training.
The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
arXiv Detail & Related papers (2022-10-14T08:31:45Z) - Self-Supervised Exploration via Temporal Inconsistency in Reinforcement
Learning [17.360622968442982]
We present a novel intrinsic reward inspired by human learning, as humans evaluate curiosity by comparing current observations with historical knowledge.
Our method involves training a self-supervised prediction model, saving snapshots of the model parameters, and using nuclear norm to evaluate the temporal inconsistency between the predictions of different snapshots as intrinsic rewards.
arXiv Detail & Related papers (2022-08-24T08:19:41Z) - Imitating Past Successes can be Very Suboptimal [145.70788608016755]
We show that existing outcome-conditioned imitation learning methods do not necessarily improve the policy.
We show that a simple modification results in a method that does guarantee policy improvement.
Our aim is not to develop an entirely new method, but rather to explain how a variant of outcome-conditioned imitation learning can be used to maximize rewards.
arXiv Detail & Related papers (2022-06-07T15:13:43Z) - Causal Confusion and Reward Misidentification in Preference-Based Reward
Learning [33.944367978407904]
We study causal confusion and reward misidentification when learning from preferences.
We find that the presence of non-causal distractor features, noise in the stated preferences, and partial state observability can all exacerbate reward misidentification.
arXiv Detail & Related papers (2022-04-13T18:41:41Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning.
We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z) - Combating False Negatives in Adversarial Imitation Learning [67.99941805086154]
In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the desired behavior.
As the trained policy learns to be more successful, the negative examples become increasingly similar to expert ones.
We propose a method to alleviate the impact of false negatives and test it on the BabyAI environment.
arXiv Detail & Related papers (2020-02-02T14:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.