A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
- URL: http://arxiv.org/abs/2503.06960v2
- Date: Sun, 23 Mar 2025 08:34:06 GMT
- Title: A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning
- Authors: Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, Xiaojuan Qi,
- Abstract summary: Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
- Score: 67.72413262980272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear. Through systematic evaluation, we find that while DINO and iBOT outperform MAE across visuomotor control and perception tasks, they struggle when trained on non-(single-)object-centric (NOC) data--a limitation strongly correlated with their diminished ability to learn object-centric representations. This investigation indicates that the ability to form object-centric representations from the non-object-centric robotics dataset is the key to success for PVMs. Motivated by this discovery, we designed SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck to reduce the number of prototypes to encourage the emergence of objectness as well as cross-view consistency regularization for encouraging multiview invariance. Our experiments encompass pre-training on object-centric, scene-centric, web-crawled, and ego-centric data. Across all settings, our approach learns transferrable representations and achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations. When scaled up with million-scale datasets, our method also demonstrates superior data efficiency and scalability. Our code and models are publicly available at https://github.com/CVMI-Lab/SlotMIM.
Related papers
- ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration [10.558622685760346]
We present a simple yet effective approach for achieving object generalization through Vision-Language-Action models.<n>Our method provides a lightweight and scalable way to inject knowledge about the target object.<n>We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate.
arXiv Detail & Related papers (2025-02-26T15:56:36Z) - Escaping The Big Data Paradigm in Self-Supervised Representation Learning [2.10796947080293]
We introduce SCOTT, a shallow tokenization architecture compatible with Masked Image Modeling tasks.<n>SCOTT injects convolutional inductive biases into Vision Transformers (ViTs), enhancing their efficacy in small-scale data regimes.<n>We validate our method on three small-size, standard-resoultion, fine-grained datasets: Oxford Flowers-102, Oxford IIIT Pets-37, and ImageNet-100.
arXiv Detail & Related papers (2025-02-25T10:21:49Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Heuristic Vision Pre-Training with Self-Supervised and Supervised
Multi-Task Learning [0.0]
We propose a novel pre-training framework by adopting both self-supervised and supervised visual pre-text tasks in a multi-task manner.
Results show that our pre-trained models can deliver results on par with or better than state-of-the-art (SOTA) results on multiple visual tasks.
arXiv Detail & Related papers (2023-10-11T14:06:04Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Bridging the Gap to Real-World Object-Centric Learning [66.55867830853803]
We show that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way.
Our approach, DINOSAUR, significantly out-performs existing object-centric learning models on simulated data.
arXiv Detail & Related papers (2022-09-29T15:24:47Z) - MetaGraspNet: A Large-Scale Benchmark Dataset for Vision-driven Robotic
Grasping via Physics-based Metaverse Synthesis [78.26022688167133]
We present a large-scale benchmark dataset for vision-driven robotic grasping via physics-based metaverse synthesis.
The proposed dataset contains 100,000 images and 25 different object types.
We also propose a new layout-weighted performance metric alongside the dataset for evaluating object detection and segmentation performance.
arXiv Detail & Related papers (2021-12-29T17:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.