Multi-State TD Target for Model-Free Reinforcement Learning
- URL: http://arxiv.org/abs/2405.16522v4
- Date: Fri, 2 Aug 2024 00:21:41 GMT
- Title: Multi-State TD Target for Model-Free Reinforcement Learning
- Authors: Wuhao Wang, Zhiyong Chen, Lepeng Zhang,
- Abstract summary: Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs.
We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states.
- Score: 3.9801926395657325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs using a TD target. This target represents an improved estimate of the true value by incorporating both immediate rewards and the estimated value of subsequent states. Traditionally, TD learning relies on the value of a single subsequent state. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states. Building on this new MSTD concept, we develop complete actor-critic algorithms that include management of replay buffers in two modes, and integrate with deep deterministic policy optimization (DDPG) and soft actor-critic (SAC). Experimental results demonstrate that algorithms employing the MSTD target significantly improve learning performance compared to traditional methods.The code is provided on GitHub.
Related papers
- Discerning Temporal Difference Learning [5.439020425819001]
Temporal difference learning (TD) is a foundational concept in reinforcement learning (RL)
We propose a novel TD algorithm named discerning TD learning (DTD)
arXiv Detail & Related papers (2023-10-12T07:38:10Z) - Value-Consistent Representation Learning for Data-Efficient
Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making.
Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values.
It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z) - Temporal Difference Learning for Model Predictive Control [29.217382374051347]
Data-driven model predictive control has two key advantages over model-free methods.
TD-MPC achieves superior sample efficiency and performance over prior work on both state and image-based continuous control tasks.
arXiv Detail & Related papers (2022-03-09T18:58:28Z) - A Generalized Bootstrap Target for Value-Learning, Efficiently Combining
Value and Feature Predictions [39.17511693008055]
Estimating value functions is a core component of reinforcement learning algorithms.
We focus on bootstrapping targets used when estimating value functions.
We propose a new backup target, the $eta$-return mixture.
arXiv Detail & Related papers (2022-01-05T21:54:55Z) - Centralizing State-Values in Dueling Networks for Multi-Robot
Reinforcement Learning Mapless Navigation [87.85646257351212]
We study the problem of multi-robot mapless navigation in the popular Training and Decentralized Execution (CTDE) paradigm.
This problem is challenging when each robot considers its path without explicitly sharing observations with other robots.
We propose a novel architecture for CTDE that uses a centralized state-value network to compute a joint state-value.
arXiv Detail & Related papers (2021-12-16T16:47:00Z) - Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks.
estimation noise becomes a bias after the max operator in the policy improvement step.
We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z) - DRILL-- Deep Reinforcement Learning for Refinement Operators in
$\mathcal{ALC}$ [1.9036571490366496]
We propose DRILL -- a novel class expression learning approach that uses a convolutional deep Q-learning model to steer its search.
By virtue of its architecture, DRILL is able to compute the expected discounted cumulated future reward of more than $103$ class expressions in a second on standard hardware.
arXiv Detail & Related papers (2021-06-29T12:57:45Z) - Preferential Temporal Difference Learning [53.81943554808216]
We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update.
We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
arXiv Detail & Related papers (2021-06-11T17:05:15Z) - Experience Replay with Likelihood-free Importance Weights [123.52005591531194]
We propose to reweight experiences based on their likelihood under the stationary distribution of the current policy.
We apply the proposed approach empirically on two competitive methods, Soft Actor Critic (SAC) and Twin Delayed Deep Deterministic policy gradient (TD3)
arXiv Detail & Related papers (2020-06-23T17:17:44Z) - Zeroth-Order Supervised Policy Improvement [94.0748002906652]
Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL)
We propose Zeroth-Order Supervised Policy Improvement (ZOSPI)
ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods.
arXiv Detail & Related papers (2020-06-11T16:49:23Z) - Universal Value Density Estimation for Imitation Learning and
Goal-Conditioned Reinforcement Learning [5.406386303264086]
In either case, effective solutions require the agent to reliably reach a specified state.
This work introduces an approach which utilizes recent advances in density estimation to effectively learn to reach a given state.
As our first contribution, we use this approach for goal-conditioned reinforcement learning and show that it is both efficient and does not suffer from hindsight bias in domains.
As our second contribution, we extend the approach to imitation learning and show that it achieves state-of-the art demonstration sample-efficiency on standard benchmark tasks.
arXiv Detail & Related papers (2020-02-15T23:46:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.