Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer
- URL: http://arxiv.org/abs/2306.12860v1
- Date: Thu, 22 Jun 2023 13:14:59 GMT
- Title: Learning from Visual Observation via Offline Pretrained State-to-Go
Transformer
- Authors: Bohan Zhou, Ke Li, Jiechuan Jiang, Zongqing Lu
- Abstract summary: We propose a two-stage framework for learning from visual observation.
In the first stage, we pretrain State-to-Go Transformer offline to predict and differentiate latent transitions of demonstrations.
In the second stage, the STG Transformer provides intrinsic rewards for downstream reinforcement learning tasks.
- Score: 29.548242447584194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning from visual observation (LfVO), aiming at recovering policies from
only visual observation data, is promising yet a challenging problem. Existing
LfVO approaches either only adopt inefficient online learning schemes or
require additional task-specific information like goal states, making them not
suited for open-ended tasks. To address these issues, we propose a two-stage
framework for learning from visual observation. In the first stage, we
introduce and pretrain State-to-Go (STG) Transformer offline to predict and
differentiate latent transitions of demonstrations. Subsequently, in the second
stage, the STG Transformer provides intrinsic rewards for downstream
reinforcement learning tasks where an agent learns merely from intrinsic
rewards. Empirical results on Atari and Minecraft show that our proposed method
outperforms baselines and in some tasks even achieves performance comparable to
the policy learned from environmental rewards. These results shed light on the
potential of utilizing video-only data to solve difficult visual reinforcement
learning tasks rather than relying on complete offline datasets containing
states, actions, and rewards. The project's website and code can be found at
https://sites.google.com/view/stgtransformer.
Related papers
- Pre-trained Visual Dynamics Representations for Efficient Policy Learning [33.62440075940917]
We propose Pre-trained Visual Dynamics Representations (PVDR) to bridge the domain gap between videos and downstream tasks for efficient policy learning.
The pre-trained visual dynamics representations capture the visual dynamics prior knowledge in the videos.
This abstract prior knowledge can be readily adapted to downstream tasks and aligned with executable actions through online adaptation.
arXiv Detail & Related papers (2024-11-05T15:18:02Z) - Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration [54.8229698058649]
We study how unlabeled prior trajectory data can be leveraged to learn efficient exploration strategies.
Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits.
We empirically show that SUPE reliably outperforms prior strategies, successfully solving a suite of long-horizon, sparse-reward tasks.
arXiv Detail & Related papers (2024-10-23T17:58:45Z) - Do Vision-Language Transformers Exhibit Visual Commonsense? An Empirical Study of VCR [51.72751335574947]
Visual Commonsense Reasoning (VCR) calls for explanatory reasoning behind question answering over visual scenes.
Progress on the benchmark dataset stems largely from the recent advancement of Vision-Language Transformers (VL Transformers)
This paper posits that the VL Transformers do not exhibit visual commonsense, which is the key to VCR.
arXiv Detail & Related papers (2024-05-27T08:26:58Z) - Value Explicit Pretraining for Learning Transferable Representations [11.069853883599102]
We propose a method that learns generalizable representations for transfer reinforcement learning.
We learn new tasks that share similar objectives as previously learned tasks, by learning an encoder for objective-conditioned representations.
Experiments using a realistic navigation simulator and Atari benchmark show that the pretrained encoder produced by our method outperforms current SoTA pretraining methods.
arXiv Detail & Related papers (2023-12-19T17:12:35Z) - Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient
Autonomous Navigation [15.501449762687148]
We present a Goal-guided Transformer-enabled reinforcement learning (GTRL) approach for goal-driven navigation.
Our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process.
Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization.
arXiv Detail & Related papers (2023-01-01T07:14:30Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Pushing the Limits of Simple Pipelines for Few-Shot Learning: External
Data and Fine-Tuning Make a Difference [74.80730361332711]
Few-shot learning is an important and topical problem in computer vision.
We show that a simple transformer-based pipeline yields surprisingly good performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-15T02:55:58Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Off-policy Imitation Learning from Visual Inputs [83.22342811160114]
We propose OPIfVI, which is composed of an off-policy learning manner, data augmentation, and encoder techniques.
We show that OPIfVI is able to achieve expert-level performance and outperform existing baselines.
arXiv Detail & Related papers (2021-11-08T09:06:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.