ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data
- URL: http://arxiv.org/abs/2503.18210v1
- Date: Sun, 23 Mar 2025 21:24:33 GMT
- Title: ViVa: Video-Trained Value Functions for Guiding Online RL from Diverse Data
- Authors: Nitish Dashora, Dibya Ghosh, Sergey Levine,
- Abstract summary: We propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data.<n>We use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward.<n>Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.
- Score: 56.217490064597506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Online reinforcement learning (RL) with sparse rewards poses a challenge partly because of the lack of feedback on states leading to the goal. Furthermore, expert offline data with reward signal is rarely available to provide this feedback and bootstrap online learning. How can we guide online agents to the right solution without this on-task data? Reward shaping offers a solution by providing fine-grained signal to nudge the policy towards the optimal solution. However, reward shaping often requires domain knowledge to hand-engineer heuristics for a specific goal. To enable more general and inexpensive guidance, we propose and analyze a data-driven methodology that automatically guides RL by learning from widely available video data such as Internet recordings, off-task demonstrations, task failures, and undirected environment interaction. By learning a model of optimal goal-conditioned value from diverse passive data, we open the floor to scaling up and using various data sources to model general goal-reaching behaviors relevant to guiding online RL. Specifically, we use intent-conditioned value functions to learn from diverse videos and incorporate these goal-conditioned values into the reward. Our experiments show that video-trained value functions work well with a variety of data sources, exhibit positive transfer from human video pre-training, can generalize to unseen goals, and scale with dataset size.
Related papers
- Real-World Offline Reinforcement Learning from Vision Language Model Feedback [19.494335952082466]
offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions.
Most existing offline RL works assume the dataset is already labeled with the task rewards.
We propose a novel system that automatically generates reward labels for offline datasets.
arXiv Detail & Related papers (2024-11-08T02:12:34Z) - Hindsight Preference Learning for Offline Preference-based Reinforcement Learning [22.870967604847458]
Offline preference-based reinforcement learning (RL) focuses on optimizing policies using human preferences between pairs of trajectory segments selected from an offline dataset.
We propose to model human preferences using rewards conditioned on future outcomes of the trajectory segments.
Our proposed method, Hindsight Preference Learning (HPL), can facilitate credit assignment by taking full advantage of vast trajectory data available in massive unlabeled datasets.
arXiv Detail & Related papers (2024-07-05T12:05:37Z) - Towards Generalist Robot Learning from Internet Video: A Survey [56.621902345314645]
We present an overview of the emerging field of Learning from Videos (LfV)
LfV aims to address the robotics data bottleneck by augmenting traditional robot data with large-scale internet video data.
We provide a review of current methods for extracting knowledge from large-scale internet video, addressing key challenges in LfV, and boosting downstream robot and reinforcement learning via the use of video data.
arXiv Detail & Related papers (2024-04-30T15:57:41Z) - Robotic Offline RL from Internet Videos via Value-Function Pre-Training [67.44673316943475]
We develop a system for leveraging large-scale human video datasets in robotic offline RL.
We show that value learning on video datasets learns representations more conducive to downstream robotic offline RL than other approaches.
arXiv Detail & Related papers (2023-09-22T17:59:14Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Model-Based Reinforcement Learning with Multi-Task Offline Pretraining [59.82457030180094]
We present a model-based RL method that learns to transfer potentially useful dynamics and action demonstrations from offline data to a novel task.
The main idea is to use the world models not only as simulators for behavior learning but also as tools to measure the task relevance.
We demonstrate the advantages of our approach compared with the state-of-the-art methods in Meta-World and DeepMind Control Suite.
arXiv Detail & Related papers (2023-06-06T02:24:41Z) - Reinforcement Learning from Passive Data via Latent Intentions [86.4969514480008]
We show that passive data can still be used to learn features that accelerate downstream RL.
Our approach learns from passive data by modeling intentions.
Our experiments demonstrate the ability to learn from many forms of passive data, including cross-embodiment video data and YouTube videos.
arXiv Detail & Related papers (2023-04-10T17:59:05Z) - Few-Shot Preference Learning for Human-in-the-Loop RL [13.773589150740898]
Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries.
We reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot.
arXiv Detail & Related papers (2022-12-06T23:12:26Z) - PsiPhi-Learning: Reinforcement Learning with Demonstrations using
Successor Features and Inverse Temporal Difference Learning [102.36450942613091]
We propose an inverse reinforcement learning algorithm, called emphinverse temporal difference learning (ITD)
We show how to seamlessly integrate ITD with learning from online environment interactions, arriving at a novel algorithm for reinforcement learning with demonstrations, called $Psi Phi$-learning.
arXiv Detail & Related papers (2021-02-24T21:12:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.