Learning Value Functions from Undirected State-only Experience
- URL: http://arxiv.org/abs/2204.12458v1
- Date: Tue, 26 Apr 2022 17:24:36 GMT
- Title: Learning Value Functions from Undirected State-only Experience
- Authors: Matthew Chang, Arjun Gupta, Saurabh Gupta
- Abstract summary: We show that Markov Qlearning in discrete decision processes (MDPs) learns the same value function under any arbitrary refinement of the action space.
This theoretical result motivates the design of Latent Action Q-learning or LAQ, an offline RL method that can learn effective value functions from state-only experience.
We show that LAQ can recover value functions that have high correlation with value functions learned using ground truth actions.
- Score: 17.76847333440422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper tackles the problem of learning value functions from undirected
state-only experience (state transitions without action labels i.e. (s,s',r)
tuples). We first theoretically characterize the applicability of Q-learning in
this setting. We show that tabular Q-learning in discrete Markov decision
processes (MDPs) learns the same value function under any arbitrary refinement
of the action space. This theoretical result motivates the design of Latent
Action Q-learning or LAQ, an offline RL method that can learn effective value
functions from state-only experience. Latent Action Q-learning (LAQ) learns
value functions using Q-learning on discrete latent actions obtained through a
latent-variable future prediction model. We show that LAQ can recover value
functions that have high correlation with value functions learned using ground
truth actions. Value functions learned using LAQ lead to sample efficient
acquisition of goal-directed behavior, can be used with domain-specific
low-level controllers, and facilitate transfer across embodiments. Our
experiments in 5 environments ranging from 2D grid world to 3D visual
navigation in realistic environments demonstrate the benefits of LAQ over
simpler alternatives, imitation learning oracles, and competing methods.
Related papers
- Towards Plastic and Stable Exemplar-Free Incremental Learning: A Dual-Learner Framework with Cumulative Parameter Averaging [12.168402195820649]
We propose a Dual-Learner framework with Cumulative.
Averaging (DLCPA)
We show that DLCPA outperforms several state-of-the-art exemplar-free baselines in both Task-IL and Class-IL settings.
arXiv Detail & Related papers (2023-10-28T08:48:44Z) - Learning Reward for Physical Skills using Large Language Model [5.795405764196473]
Large Language Models contain valuable task-related knowledge that can aid in learning reward functions.
We aim to extract task knowledge from LLMs using environment feedback to create efficient reward functions for physical skills.
arXiv Detail & Related papers (2023-10-21T19:10:06Z) - Contrastive Example-Based Control [163.6482792040079]
We propose a method for offline, example-based control that learns an implicit model of multi-step transitions, rather than a reward function.
Across a range of state-based and image-based offline control tasks, our method outperforms baselines that use learned reward functions.
arXiv Detail & Related papers (2023-07-24T19:43:22Z) - VA-learning as a more efficient alternative to Q-learning [49.526579981437315]
We introduce VA-learning, which directly learns advantage function and value function using bootstrapping.
VA-learning learns off-policy and enjoys similar theoretical guarantees as Q-learning.
Thanks to the direct learning of advantage function and value function, VA-learning improves the sample efficiency over Q-learning.
arXiv Detail & Related papers (2023-05-29T15:44:47Z) - Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL.
Our approach learns from passive data by modeling intentions.
Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z) - Goal-Conditioned Q-Learning as Knowledge Distillation [136.79415677706612]
We explore a connection between off-policy reinforcement learning in goal-conditioned settings and knowledge distillation.
We empirically show that this can improve the performance of goal-conditioned off-policy reinforcement learning when the space of goals is high-dimensional.
We also show that this technique can be adapted to allow for efficient learning in the case of multiple simultaneous sparse goals.
arXiv Detail & Related papers (2022-08-28T22:01:10Z) - Online Target Q-learning with Reverse Experience Replay: Efficiently
finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results.
We present novel methods Q-Rex and Q-RexDaRe.
We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z) - Visual Transformer for Task-aware Active Learning [49.903358393660724]
We present a novel pipeline for pool-based Active Learning.
Our method exploits accessible unlabelled examples during training to estimate their co-relation with the labelled examples.
Visual Transformer models non-local visual concept dependency between labelled and unlabelled examples.
arXiv Detail & Related papers (2021-06-07T17:13:59Z) - Pre-trained Word Embeddings for Goal-conditional Transfer Learning in
Reinforcement Learning [0.0]
We show how a pre-trained task-independent language model can make a goal-conditional RL agent more sample efficient.
We do this by facilitating transfer learning between different related tasks.
arXiv Detail & Related papers (2020-07-10T06:42:00Z) - Transfer Reinforcement Learning under Unobserved Contextual Information [16.895704973433382]
We study a transfer reinforcement learning problem where the state transitions and rewards are affected by the environmental context.
We develop a method to obtain causal bounds on the transition and reward functions using the demonstrator's data.
We propose new Q learning and UCB-Q learning algorithms that converge to the true value function without bias.
arXiv Detail & Related papers (2020-03-09T22:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.