Related papers: Preferential Temporal Difference Learning

Preferential Temporal Difference Learning

URL: http://arxiv.org/abs/2106.06508v1
Date: Fri, 11 Jun 2021 17:05:15 GMT
Title: Preferential Temporal Difference Learning
Authors: Nishanth Anand, Doina Precup
Abstract summary: We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
Score: 53.81943554808216
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.

Related papers

Multi-State TD Target for Model-Free Reinforcement Learning [3.9801926395657325]
Temporal difference (TD) learning is a fundamental technique in reinforcement learning that updates value estimates for states or state-action pairs. We propose an enhanced multi-state TD (MSTD) target that utilizes the estimated values of multiple subsequent states.
arXiv Detail & Related papers (2024-05-26T11:17:49Z)
Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL. Our approach learns from passive data by modeling intentions. Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z)
Value-Consistent Representation Learning for Data-Efficient Reinforcement Learning [105.70602423944148]
We propose a novel method, called value-consistent representation learning (VCR), to learn representations that are directly related to decision-making. Instead of aligning this imagined state with a real state returned by the environment, VCR applies a $Q$-value head on both states and obtains two distributions of action values. It has been demonstrated that our methods achieve new state-of-the-art performance for search-free RL algorithms.
arXiv Detail & Related papers (2022-06-25T03:02:25Z)
Topological Experience Replay [22.84244156916668]
deep Q-learning methods update Q-values using state transitions sampled from the experience replay buffer. We organize the agent's experience into a graph that explicitly tracks the dependency between Q-values of states. We empirically show that our method is substantially more data-efficient than several baselines on a diverse range of goal-reaching tasks.
arXiv Detail & Related papers (2022-03-29T18:28:20Z)
Temporal-Difference Value Estimation via Uncertainty-Guided Soft Updates [110.92598350897192]
Q-Learning has proven effective at learning a policy to perform control tasks. estimation noise becomes a bias after the max operator in the policy improvement step. We present Unbiased Soft Q-Learning (UQL), which extends the work of EQL from two action, finite state spaces to multi-action, infinite state Markov Decision Processes.
arXiv Detail & Related papers (2021-10-28T00:07:19Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Approximate information state for approximate planning and reinforcement learning in partially observed systems [0.7646713951724009]
We show that if a function of the history (called approximate information state (AIS)) approximately satisfies the properties of the information state, then there is a corresponding approximate dynamic program. We show that several approximations in state, observation and action spaces in literature can be viewed as instances of AIS. A salient feature of AIS is that it can be learnt from data.
arXiv Detail & Related papers (2020-10-17T18:30:30Z)
Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm. Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function. We develop an approach for representation learning in RL that sits in between these two extremes. This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
Minimax Value Interval for Off-Policy Evaluation and Policy Optimization [28.085288472120705]
We study minimax methods for off-policy evaluation using value functions and marginalized importance weights. Despite that they hold promises of overcoming the exponential variance in traditional importance sampling, several key problems remain. For the sake of trustworthy OPE, is there anyway to quantify the biases?
arXiv Detail & Related papers (2020-02-06T02:54:11Z)
Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings [0.0]
We construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status.
arXiv Detail & Related papers (2020-01-13T19:42:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.