Disentangled Object-Centric Image Representation for Robotic Manipulation
- URL: http://arxiv.org/abs/2503.11565v1
- Date: Fri, 14 Mar 2025 16:33:48 GMT
- Title: Disentangled Object-Centric Image Representation for Robotic Manipulation
- Authors: David Emukpere, Romain Deffayet, Bingbing Wu, Romain Brégier, Michael Niemaz, Jean-Luc Meunier, Denys Proux, Jean-Michel Renders, Seungsu Kim,
- Abstract summary: We propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment.<n>We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments.
- Score: 6.775909411692767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments. Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.
Related papers
- A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning [67.72413262980272]
Pre-trained vision models (PVMs) are fundamental to modern robotics, yet their optimal configuration remains unclear.<n>We develop SlotMIM, a method that induces object-centric representations by introducing a semantic bottleneck.<n>Our approach achieves significant improvements over prior work in image recognition, scene understanding, and robot learning evaluations.
arXiv Detail & Related papers (2025-03-10T06:18:31Z) - ObjectVLA: End-to-End Open-World Object Manipulation Without Demonstration [10.558622685760346]
We present a simple yet effective approach for achieving object generalization through Vision-Language-Action models.<n>Our method provides a lightweight and scalable way to inject knowledge about the target object.<n>We evaluate ObjectVLA on a real robotic platform, demonstrating its ability to generalize across 100 novel objects with a 64% success rate.
arXiv Detail & Related papers (2025-02-26T15:56:36Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z) - Robust and Controllable Object-Centric Learning through Energy-based
Models [95.68748828339059]
ours is a conceptually simple and general approach to learning object-centric representations through an energy-based model.
We show that ours can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations.
arXiv Detail & Related papers (2022-10-11T15:11:15Z) - Self-Supervised Learning of Multi-Object Keypoints for Robotic
Manipulation [8.939008609565368]
In this paper, we demonstrate the efficacy of learning image keypoints via the Dense Correspondence pretext task for downstream policy learning.
We evaluate our approach on diverse robot manipulation tasks, compare it to other visual representation learning approaches, and demonstrate its flexibility and effectiveness for sample-efficient policy learning.
arXiv Detail & Related papers (2022-05-17T13:15:07Z) - ManiSkill: Learning-from-Demonstrations Benchmark for Generalizable
Manipulation Skills [27.214053107733186]
We propose SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill) for learning generalizable object manipulation skills.
ManiSkill supports object-level variations by utilizing a rich and diverse set of articulated objects.
ManiSkill can encourage the robot learning community to explore more on learning generalizable object manipulation skills.
arXiv Detail & Related papers (2021-07-30T08:20:22Z) - Unadversarial Examples: Designing Objects for Robust Vision [100.4627585672469]
We develop a framework that exploits the sensitivity of modern machine learning algorithms to input perturbations in order to design "robust objects"
We demonstrate the efficacy of the framework on a wide variety of vision-based tasks ranging from standard benchmarks to (in-simulation) robotics.
arXiv Detail & Related papers (2020-12-22T18:26:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.