Related papers: Episodic Reinforcement Learning with Expanded State-reward Space

Episodic Reinforcement Learning with Expanded State-reward Space

URL: http://arxiv.org/abs/2401.10516v1
Date: Fri, 19 Jan 2024 06:14:36 GMT
Title: Episodic Reinforcement Learning with Expanded State-reward Space
Authors: Dayang Liang, Yaru Zhang and Yunlong Liu
Abstract summary: We introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. Our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss.
Score: 1.479675621064679
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison.

Related papers

Causal Information Prioritization for Efficient Reinforcement Learning [21.74375718642216]
Current Reinforcement Learning (RL) methods often suffer from sample-inefficiency. Recent causal approaches aim to address this problem, but they lack grounded modeling of reward-guided causal understanding of states and actions. We propose a novel method named Causal Information Prioritization (CIP) that improves sample efficiency by leveraging factored MDPs.
arXiv Detail & Related papers (2025-02-14T11:44:17Z)
Episodic Novelty Through Temporal Distance [39.66260812278513]
Episodic Novelty Through Temporal Distance (ETD) is a novel approach that introduces temporal distance as a robust metric for state similarity and intrinsic reward. By employing contrastive learning, ETD accurately estimates temporal distances and derives intrinsic rewards based on the novelty of states within the current episode.
arXiv Detail & Related papers (2025-01-26T06:43:45Z)
Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions. We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z)
SAFE-RL: Saliency-Aware Counterfactual Explainer for Deep Reinforcement Learning Policies [13.26174103650211]
A lack of explainability of learned policies impedes its uptake in safety-critical applications, such as automated driving systems. Counterfactual (CF) explanations have recently gained prominence for their ability to interpret black-box Deep Learning (DL) models. We propose using a saliency map to identify the most influential input pixels across the sequence of past observed states by the agent. We evaluate the effectiveness of our framework in diverse domains, including ADS, Atari Pong, Pacman and space-invaders games.
arXiv Detail & Related papers (2024-04-28T21:47:34Z)
Efficient Deep Reinforcement Learning Requires Regulating Overfitting [91.88004732618381]
We show that high temporal-difference (TD) error on the validation set of transitions is the main culprit that severely affects the performance of deep RL algorithms. We show that a simple online model selection method that targets the validation TD error is effective across state-based DMC and Gym tasks.
arXiv Detail & Related papers (2023-04-20T17:11:05Z)
Detecting Out-of-distribution Examples via Class-conditional Impressions Reappearing [30.938412222724608]
Out-of-distribution (OOD) detection aims at enhancing standard deep neural networks to distinguish anomalous inputs from original training data. Due to privacy and security, auxiliary data tends to be impractical in a real-world scenario. We propose a data-free method without training on natural data, called Class-Conditional Impressions Reappearing (C2IR)
arXiv Detail & Related papers (2023-03-17T02:55:08Z)
Neural Episodic Control with State Abstraction [38.95199070504417]
Existing Deep Reinforcement Learning (DRL) algorithms suffer from sample inefficiency. This work introduces Neural Episodic Control with State Abstraction (NECSA) We evaluate our approach to the MuJoCo and Atari tasks in OpenAI gym domains.
arXiv Detail & Related papers (2023-01-27T01:55:05Z)
Age of Semantics in Cooperative Communications: To Expedite Simulation Towards Real via Offline Reinforcement Learning [53.18060442931179]
We propose the age of semantics (AoS) for measuring semantics freshness of status updates in a cooperative relay communication system. We derive an online deep actor-critic (DAC) learning scheme under the on-policy temporal difference learning framework. We then put forward a novel offline DAC scheme, which estimates the optimal control policy from a previously collected dataset.
arXiv Detail & Related papers (2022-09-19T11:55:28Z)
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z)
Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration. Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states. We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z)
Regularized Behavior Value Estimation [31.332929202377]
We introduce Regularized Behavior Value Estimation (R-BVE) R-BVE estimates the value of the behavior policy during training and only performs policy improvement at deployment time. We provide ample empirical evidence of R-BVE's effectiveness, including state-of-the-art performance on the RL Unplugged ATARI dataset.
arXiv Detail & Related papers (2021-03-17T11:34:54Z)
Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.