Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies
- URL: http://arxiv.org/abs/2405.15916v1
- Date: Fri, 24 May 2024 20:20:15 GMT
- Title: Recasting Generic Pretrained Vision Transformers As Object-Centric Scene Encoders For Manipulation Policies
- Authors: Jianing Qian, Anastasios Panagopoulos, Dinesh Jayaraman,
- Abstract summary: SOFT is a wrapper around pre-trained vision transformer (PVT) models.
Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions.
We demonstrate in each case that policies trained on SOFT far outstrip standard PVT representations for manipulation tasks in simulated and real settings.
- Score: 23.378072284295005
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generic re-usable pre-trained image representation encoders have become a standard component of methods for many computer vision tasks. As visual representations for robots however, their utility has been limited, leading to a recent wave of efforts to pre-train robotics-specific image encoders that are better suited to robotic tasks than their generic counterparts. We propose Scene Objects From Transformers, abbreviated as SOFT, a wrapper around pre-trained vision transformer (PVT) models that bridges this gap without any further training. Rather than construct representations out of only the final layer activations, SOFT individuates and locates object-like entities from PVT attentions, and describes them with PVT activations, producing an object-centric embedding. Across standard choices of generic pre-trained vision transformers PVT, we demonstrate in each case that policies trained on SOFT(PVT) far outstrip standard PVT representations for manipulation tasks in simulated and real settings, approaching the state-of-the-art robotics-aware representations. Code, appendix and videos: https://sites.google.com/view/robot-soft/
Related papers
- VAT: Vision Action Transformer by Unlocking Full Representation of ViT [10.192713461564606]
Vision Transformers (ViTs) are standard for visual perception, yet most methods discard valuable information by using only the final layer's features.<n>We argue this provides an insufficient representation and propose the Vision Action Transformer (VAT)<n>VAT processes specialized action tokens with visual features across all transformer layers, enabling a deep and progressive fusion of perception and action generation.
arXiv Detail & Related papers (2025-12-03T10:50:40Z) - Composing Pre-Trained Object-Centric Representations for Robotics From "What" and "Where" Foundation Models [27.381128884213812]
We propose a new framework for building pre-trained object-centric representations for robotic control.
We use segmentations from a pre-trained model to stably locate across timesteps, various entities in the scene, capturing "where" information.
On various simulated and real robotic tasks, we show that imitation policies for robotic manipulators trained on POCR achieve better performance.
arXiv Detail & Related papers (2024-04-20T21:51:15Z) - PriViT: Vision Transformers for Fast Private Inference [55.36478271911595]
Vision Transformer (ViT) architecture has emerged as the backbone of choice for state-of-the-art deep models for computer vision applications.
ViTs are ill-suited for private inference using secure multi-party protocols, due to the large number of non-polynomial operations.
We propose PriViT, an algorithm to selectively " Taylorize" nonlinearities in ViTs while maintaining their prediction accuracy.
arXiv Detail & Related papers (2023-10-06T21:45:05Z) - Dynamic Grained Encoder for Vision Transformers [150.02797954201424]
This paper introduces sparse queries for vision transformers to exploit the intrinsic spatial redundancy of natural images.
We propose a Dynamic Grained for vision transformers, which can adaptively assign a suitable number of queries to each spatial region.
Our encoder allows the state-of-the-art vision transformers to reduce computational complexity by 40%-60% while maintaining comparable performance on image classification.
arXiv Detail & Related papers (2023-01-10T07:55:29Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Aggregated Pyramid Vision Transformer: Split-transform-merge Strategy
for Image Recognition without Convolutions [1.1032962642000486]
This work is based on Vision Transformer, combined with the pyramid architecture, using Split-merge-transform to propose the group encoder and name the network architecture Aggregated Pyramid Vision Transformer (APVT)
We perform image classification tasks on the CIFAR-10 dataset and object detection tasks on the COCO 2017 dataset.
arXiv Detail & Related papers (2022-03-02T09:14:28Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.