Pre-trained Visual Dynamics Representations for Efficient Policy Learning
- URL: http://arxiv.org/abs/2411.03169v1
- Date: Tue, 05 Nov 2024 15:18:02 GMT
- Title: Pre-trained Visual Dynamics Representations for Efficient Policy Learning
- Authors: Hao Luo, Bohan Zhou, Zongqing Lu,
- Abstract summary: We propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning.
The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos.
This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation.
- Score: 33.62440075940917
- License:
- Abstract: Pre-training for Reinforcement Learning (RL) with purely video data is a valuable yet challenging problem. Although in-the-wild videos are readily available and inhere a vast amount of prior world knowledge, the absence of action annotations and the common domain gap with downstream tasks hinder utilizing videos for RL pre-training. To address the challenge of pre-training with videos, we propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning. By adopting video prediction as a pre-training task, we use a Transformer-based Conditional Variational Autoencoder (CVAE) to learn visual dynamics representations. The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos. This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation. We conduct experiments on a series of robotics visual control tasks and verify that PVDR is an effective form for pre-training with videos to promote policy learning.
Related papers
- ActPrompt: In-Domain Feature Adaptation via Action Cues for Video Temporal Grounding [40.60371529725805]
We propose an efficient preliminary in-domain fine-tuning paradigm for feature adaptation.
We introduce Action-Cue-Injected Temporal Prompt Learning (ActPrompt), which injects action cues into the image encoder of VLM for better discovering action-sensitive patterns.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Any-point Trajectory Modeling for Policy Learning [64.23861308947852]
We introduce Any-point Trajectory Modeling (ATM) to predict future trajectories of arbitrary points within a video frame.
ATM outperforms strong video pre-training baselines by 80% on average.
We show effective transfer learning of manipulation skills from human videos and videos from a different robot morphology.
arXiv Detail & Related papers (2023-12-28T23:34:43Z) - Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning.
We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations.
Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z) - Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - PIVOT: Prompting for Video Continual Learning [50.80141083993668]
We introduce PIVOT, a novel method that leverages extensive knowledge in pre-trained models from the image domain.
Our experiments show that PIVOT improves state-of-the-art methods by a significant 27% on the 20-task ActivityNet setup.
arXiv Detail & Related papers (2022-12-09T13:22:27Z) - Action-Conditioned Contrastive Policy Pretraining [39.13710045468429]
Deep visuomotor policy learning achieves promising results in control tasks such as robotic manipulation and autonomous driving.
It requires a huge number of online interactions with the training environment, which limits its real-world application.
In this work, we aim to pretrain policy representations for driving tasks using hours-long uncurated YouTube videos.
arXiv Detail & Related papers (2022-04-05T17:58:22Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - CUPID: Adaptive Curation of Pre-training Data for Video-and-Language
Representation Learning [49.18591896085498]
We propose CUPID to bridge the domain gap between source and target data.
CUPID yields new state-of-the-art performance across multiple video-language and video tasks.
arXiv Detail & Related papers (2021-04-01T06:42:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.