Episodic Reinforcement Learning with Expanded State-reward Space
- URL: http://arxiv.org/abs/2401.10516v1
- Date: Fri, 19 Jan 2024 06:14:36 GMT
- Title: Episodic Reinforcement Learning with Expanded State-reward Space
- Authors: Dayang Liang, Yaru Zhang and Yunlong Liu
- Abstract summary: We introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information.
Our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss.
- Score: 1.479675621064679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Empowered by deep neural networks, deep reinforcement learning (DRL) has
demonstrated tremendous empirical successes in various domains, including
games, health care, and autonomous driving. Despite these advancements, DRL is
still identified as data-inefficient as effective policies demand vast numbers
of environmental samples. Recently, episodic control (EC)-based model-free DRL
methods enable sample efficiency by recalling past experiences from episodic
memory. However, existing EC-based methods suffer from the limitation of
potential misalignment between the state and reward spaces for neglecting the
utilization of (past) retrieval states with extensive information, which
probably causes inaccurate value estimation and degraded policy performance. To
tackle this issue, we introduce an efficient EC-based DRL framework with
expanded state-reward space, where the expanded states used as the input and
the expanded rewards used in the training both contain historical and current
information. To be specific, we reuse the historical states retrieved by EC as
part of the input states and integrate the retrieved MC-returns into the
immediate reward in each interactive transition. As a result, our method is
able to simultaneously achieve the full utilization of retrieval information
and the better evaluation of state values by a Temporal Difference (TD) loss.
Empirical results on challenging Box2d and Mujoco tasks demonstrate the
superiority of our method over a recent sibling method and common baselines.
Further, we also verify our method's effectiveness in alleviating Q-value
overestimation by additional experiments of Q-value comparison.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies [13.26174103650211]
A lack of explainability of learned policies impedes its uptake in safety-critical applications, such as automated driving systems.
Counterfactual (CF) explanations have recently gained prominence for their ability to interpret black-box Deep Learning (DL) models.
We propose using a saliency map to identify the most influential input pixels across the sequence of past observed states by the agent.
We evaluate the effectiveness of our framework in diverse domains, including ADS, Atari Pong, Pacman and space-invaders games.
arXiv Detail & Related papers (2024-04-28T21:47:34Z) - Efficient Deep Reinforcement Learning Requires Regulating Overfitting [91.88004732618381]
We show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms.
We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.
arXiv Detail & Related papers (2023-04-20T17:11:05Z) - Detecting Out-of-distribution Examples via Class-conditional Impressions
Reappearing [30.938412222724608]
Out-of-distribution (OOD) detection aims at enhancing standard deep neural networks to distinguish anomalous inputs from original training data.
Due to privacy and security, auxiliary data tends to be impractical in a real-world scenario.
We propose a data-free method without training on natural data, called Class-Conditional Impressions Reappearing (C2IR)
arXiv Detail & Related papers (2023-03-17T02:55:08Z) - Neural Episodic Control with State Abstraction [38.95199070504417]
Existing Deep Reinforcement Learning (DRL) algorithms suffer from sample inefficiency.
This work introduces Neural Episodic Control with State Abstraction (NECSA)
We evaluate our approach to the MuJoCo and Atari tasks in OpenAI gym domains.
arXiv Detail & Related papers (2023-01-27T01:55:05Z) - Age of Semantics in Cooperative Communications: To Expedite Simulation
Towards Real via Offline Reinforcement Learning [53.18060442931179]
We propose the age of semantics (AoS) for measuring semantics freshness of status updates in a cooperative relay communication system.
We derive an online deep actor-critic (DAC) learning scheme under the on-policy temporal difference learning framework.
We then put forward a novel offline DAC scheme, which estimates the optimal control policy from a previously collected dataset.
arXiv Detail & Related papers (2022-09-19T11:55:28Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Regularized Behavior Value Estimation [31.332929202377]
We introduce Regularized Behavior Value Estimation (R-BVE)
R-BVE estimates the value of the behavior policy during training and only performs policy improvement at deployment time.
We provide ample empirical evidence of R-BVE's effectiveness, including state-of-the-art performance on the RL Unplugged ATARI dataset.
arXiv Detail & Related papers (2021-03-17T11:34:54Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.