ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
- URL: http://arxiv.org/abs/2502.19250v2
- Date: Fri, 28 Feb 2025 15:17:11 GMT
- Title: ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration
- Authors: Minjie Zhu, Yichen Zhu, Jinming Li, Zhongyi Zhou, Junjie Wen, Xiaoyu Liu, Chaomin Shen, Yaxin Peng, Feifei Feng,
- Abstract summary: We present a simple yet effective approach for achieving object generalization through Vision-Language-Action models.<n>Our method provides a lightweight and scalable way to inject knowledge about the target object.<n>We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate.
- Score: 10.558622685760346
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Imitation learning has proven to be highly effective in teaching robots dexterous manipulation skills. However, it typically relies on large amounts of human demonstration data, which limits its scalability and applicability in dynamic, real-world environments. One key challenge in this context is object generalization, where a robot trained to perform a task with one object, such as "hand over the apple," struggles to transfer its skills to a semantically similar but visually different object, such as "hand over the peach." This gap in generalization to new objects beyond those in the same category has yet to be adequately addressed in previous work on end-to-end visuomotor policy learning. In this paper, we present a simple yet effective approach for achieving object generalization through Vision-Language-Action (VLA) models, referred to as \textbf{ObjectVLA}. Our model enables robots to generalize learned skills to novel objects without requiring explicit human demonstrations for each new target object. By leveraging vision-language pair data, our method provides a lightweight and scalable way to inject knowledge about the target object, establishing an implicit link between the object and the desired action. We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64\% success rate in selecting objects not seen during training. Furthermore, we propose a more accessible method for enhancing object generalization in VLA models, using a smartphone to capture a few images and fine-tune the pre-trained model. These results highlight the effectiveness of our approach in enabling object-level generalization and reducing the need for extensive human demonstrations, paving the way for more flexible and scalable robotic learning systems.
Related papers
- FLEX: A Framework for Learning Robot-Agnostic Force-based Skills Involving Sustained Contact Object Manipulation [9.292150395779332]
We propose a novel framework for learning object-centric manipulation policies in force space.
Our method simplifies the action space, reduces unnecessary exploration, and decreases simulation overhead.
Our evaluations demonstrate that the method significantly outperforms baselines.
arXiv Detail & Related papers (2025-03-17T17:49:47Z) - Disentangled Object-Centric Image Representation for Robotic Manipulation [6.775909411692767]
We propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment.
We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments.
arXiv Detail & Related papers (2025-03-14T16:33:48Z) - A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.
We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.
Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - Keypoint Abstraction using Large Models for Object-Relative Imitation Learning [78.92043196054071]
Generalization to novel object configurations and instances across diverse tasks and environments is a critical challenge in robotics.
Keypoint-based representations have been proven effective as a succinct representation for essential object capturing features.
We propose KALM, a framework that leverages large pre-trained vision-language models to automatically generate task-relevant and cross-instance consistent keypoints.
arXiv Detail & Related papers (2024-10-30T17:37:31Z) - Latent Action Pretraining from Videos [156.88613023078778]
We introduce Latent Action Pretraining for general Action models (LAPA)
LAPA is an unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.
We propose a method to learn from internet-scale videos that do not have robot action labels.
arXiv Detail & Related papers (2024-10-15T16:28:09Z) - Kinematic-aware Prompting for Generalizable Articulated Object
Manipulation with LLMs [53.66070434419739]
Generalizable articulated object manipulation is essential for home-assistant robots.
We propose a kinematic-aware prompting framework that prompts Large Language Models with kinematic knowledge of objects to generate low-level motion waypoints.
Our framework outperforms traditional methods on 8 categories seen and shows a powerful zero-shot capability for 8 unseen articulated object categories.
arXiv Detail & Related papers (2023-11-06T03:26:41Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Lifelong Ensemble Learning based on Multiple Representations for
Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem.
To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly.
We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z) - Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task
Learning [108.08083976908195]
We show that policies learned by existing reinforcement learning algorithms can in fact be generalist.
We show that a single generalist policy can perform in-hand manipulation of over 100 geometrically-diverse real-world objects.
Interestingly, we find that multi-task learning with object point cloud representations not only generalizes better but even outperforms single-object specialist policies.
arXiv Detail & Related papers (2021-11-04T17:59:56Z) - ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable
Manipulation Skills [27.214053107733186]
We propose SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill) for learning generalizable object manipulation skills.
ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects.
ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
arXiv Detail & Related papers (2021-07-30T08:20:22Z) - Attribute-Based Robotic Grasping with One-Grasp Adaptation [9.255994599301712]
We introduce an end-to-end learning method of attribute-based robotic grasping with one-grasp adaptation capability.
Our approach fuses the embeddings of a workspace image and a query text using a gated-attention mechanism and learns to predict instance grasping affordances.
Experimental results in both simulation and the real world demonstrate that our approach achieves over 80% instance grasping success rate on unknown objects.
arXiv Detail & Related papers (2021-04-06T03:40:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.