SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained
Networks
- URL: http://arxiv.org/abs/2307.03567v2
- Date: Sun, 22 Oct 2023 03:16:41 GMT
- Title: SpawnNet: Learning Generalizable Visuomotor Skills from Pre-trained
Networks
- Authors: Xingyu Lin, John So, Sashwat Mahalingam, Fangchen Liu, Pieter Abbeel
- Abstract summary: We present a study of the generalization capabilities of the pre-trained visual representations at the categorical level.
We propose SpawnNet, a novel two-stream architecture that learns to fuse pre-trained multi-layer representations into a separate network to learn a robust policy.
- Score: 52.766795949716986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The existing internet-scale image and video datasets cover a wide range of
everyday objects and tasks, bringing the potential of learning policies that
generalize in diverse scenarios. Prior works have explored visual pre-training
with different self-supervised objectives. Still, the generalization
capabilities of the learned policies and the advantages over well-tuned
baselines remain unclear from prior studies. In this work, we present a focused
study of the generalization capabilities of the pre-trained visual
representations at the categorical level. We identify the key bottleneck in
using a frozen pre-trained visual backbone for policy learning and then propose
SpawnNet, a novel two-stream architecture that learns to fuse pre-trained
multi-layer representations into a separate network to learn a robust policy.
Through extensive simulated and real experiments, we show significantly better
categorical generalization compared to prior approaches in imitation learning
settings. Open-sourced code and videos can be found on our website:
https://xingyu-lin.github.io/spawnnet.
Related papers
- Vision-Language Models Provide Promptable Representations for Reinforcement Learning [67.40524195671479]
We propose a novel approach that uses the vast amounts of general and indexable world knowledge encoded in vision-language models (VLMs) pre-trained on Internet-scale data for embodied reinforcement learning (RL)
We show that our approach can use chain-of-thought prompting to produce representations of common-sense semantic reasoning, improving policy performance in novel scenes by 1.5 times.
arXiv Detail & Related papers (2024-02-05T00:48:56Z) - What Makes Pre-Trained Visual Representations Successful for Robust
Manipulation? [57.92924256181857]
We find that visual representations designed for manipulation and control tasks do not necessarily generalize under subtle changes in lighting and scene texture.
We find that emergent segmentation ability is a strong predictor of out-of-distribution generalization among ViT models.
arXiv Detail & Related papers (2023-11-03T18:09:08Z) - Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z) - A Shapelet-based Framework for Unsupervised Multivariate Time Series Representation Learning [29.511632089649552]
We propose a novel URL framework for multivariate time series by learning time-series-specific shapelet-based representation.
To the best of our knowledge, this is the first work that explores the shapelet-based embedding in the unsupervised general-purpose representation learning.
A unified shapelet-based encoder and a novel learning objective with multi-grained contrasting and multi-scale alignment are particularly designed to achieve our goal.
arXiv Detail & Related papers (2023-05-30T09:31:57Z) - CoDo: Contrastive Learning with Downstream Background Invariance for
Detection [10.608660802917214]
We propose a novel object-level self-supervised learning method, called Contrastive learning with Downstream background invariance (CoDo)
The pretext task is converted to focus on instance location modeling for various backgrounds, especially for downstream datasets.
Experiments on MSCOCO demonstrate that the proposed CoDo with common backbones, ResNet50-FPN, yields strong transfer learning results for object detection.
arXiv Detail & Related papers (2022-05-10T01:26:15Z) - The Unsurprising Effectiveness of Pre-Trained Vision Models for Control [33.30717429522186]
We study the role of pre-trained visual representations for control, and in particular representations trained on large-scale computer vision datasets.
We find that pre-trained visual representations can be competitive or even better than ground-truth state representations to train control policies.
arXiv Detail & Related papers (2022-03-07T18:26:14Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Reasoning-Modulated Representations [85.08205744191078]
We study a common setting where our task is not purely opaque.
Our approach paves the way for a new class of data-efficient representation learning.
arXiv Detail & Related papers (2021-07-19T13:57:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.