Related papers: Is an object-centric representation beneficial for robotic manipulation ?

Is an object-centric representation beneficial for robotic manipulation ?

URL: http://arxiv.org/abs/2506.19408v1
Date: Tue, 24 Jun 2025 08:23:55 GMT
Title: Is an object-centric representation beneficial for robotic manipulation ?
Authors: Alexandre Chapin, Emmanuel Dellandrea, Liming Chen,
Abstract summary: Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos.<n>We evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations.<n>Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.
Score: 45.75998994869714
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object-centric representation (OCR) has recently become a subject of interest in the computer vision community for learning a structured representation of images and videos. It has been several times presented as a potential way to improve data-efficiency and generalization capabilities to learn an agent on downstream tasks. However, most existing work only evaluates such models on scene decomposition, without any notion of reasoning over the learned representation. Robotic manipulation tasks generally involve multi-object environments with potential inter-object interaction. We thus argue that they are a very interesting playground to really evaluate the potential of existing object-centric work. To do so, we create several robotic manipulation tasks in simulated environments involving multiple objects (several distractors, the robot, etc.) and a high-level of randomization (object positions, colors, shapes, background, initial positions, etc.). We then evaluate one classical object-centric method across several generalization scenarios and compare its results against several state-of-the-art hollistic representations. Our results exhibit that existing methods are prone to failure in difficult scenarios involving complex scene structures, whereas object-centric methods help overcome these challenges.

Related papers

Learning to Grasp Anything by Playing with Random Toys [65.47078295823074]
We show that robots can learn generalizable grasping using randomly assembled objects.<n>We find the key to this generalization is an object-centric visual representation induced by our proposed detection pooling mechanism.<n>We believe this work offers a promising path to scalable and generalizable learning in robotic manipulation.
arXiv Detail & Related papers (2025-10-14T17:56:10Z)
HumanoidVerse: A Versatile Humanoid for Vision-Language Guided Multi-Object Rearrangement [51.16740261131198]
We introduce HumanoidVerse, a novel framework for vision-language guided humanoid control.<n>HumanoidVerse supports consecutive manipulation of multiple objects, guided only by natural language instructions and egocentric camera RGB observations.<n>Our work represents a key step toward robust, general-purpose humanoid agents capable of executing complex, sequential tasks under real-world sensory constraints.
arXiv Detail & Related papers (2025-08-23T08:23:14Z)
Disentangled Object-Centric Image Representation for Robotic Manipulation [6.775909411692767]
We propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment.<n>We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments.
arXiv Detail & Related papers (2025-03-14T16:33:48Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
Kinematic-aware Prompting for Generalizable Articulated Object Manipulation with LLMs [53.66070434419739]
Generalizable articulated object manipulation is essential for home-assistant robots. We propose a kinematic-aware prompting framework that prompts Large Language Models with kinematic knowledge of objects to generate low-level motion waypoints. Our framework outperforms traditional methods on 8 categories seen and shows a powerful zero-shot capability for 8 unseen articulated object categories.
arXiv Detail & Related papers (2023-11-06T03:26:41Z)
Transferring Foundation Models for Generalizable Robotic Manipulation [82.12754319808197]
We propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models.<n>Our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning.<n>Demos can be found in our submitted video, and more comprehensive ones can be found in link1 or link2.
arXiv Detail & Related papers (2023-06-09T07:22:12Z)
Object Scene Representation Transformer [56.40544849442227]
We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. It is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder.
arXiv Detail & Related papers (2022-06-14T15:40:47Z)
Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations [25.33452947179541]
We show the effectiveness of object-aware representation learning techniques for robotic tasks. Our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object techniques.
arXiv Detail & Related papers (2022-05-12T19:48:11Z)
Lifelong Ensemble Learning based on Multiple Representations for Few-Shot Object Recognition [6.282068591820947]
We present a lifelong ensemble learning approach based on multiple representations to address the few-shot object recognition problem. To facilitate lifelong learning, each approach is equipped with a memory unit for storing and retrieving object information instantly. We have performed extensive sets of experiments to assess the performance of the proposed approach in offline, and open-ended scenarios.
arXiv Detail & Related papers (2022-05-04T10:29:10Z)
DemoGrasp: Few-Shot Learning for Robotic Grasping with Human Demonstration [42.19014385637538]
We propose to teach a robot how to grasp an object with a simple and short human demonstration. We first present a small sequence of RGB-D images displaying a human-object interaction. This sequence is then leveraged to build associated hand and object meshes that represent the interaction.
arXiv Detail & Related papers (2021-12-06T08:17:12Z)
Generalization in Dexterous Manipulation via Geometry-Aware Multi-Task Learning [108.08083976908195]
We show that policies learned by existing reinforcement learning algorithms can in fact be generalist. We show that a single generalist policy can perform in-hand manipulation of over 100 geometrically-diverse real-world objects. Interestingly, we find that multi-task learning with object point cloud representations not only generalizes better but even outperforms single-object specialist policies.
arXiv Detail & Related papers (2021-11-04T17:59:56Z)
O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning [24.9242853417825]
We propose a unified affordance learning framework to learn object-object interaction for various tasks. We are able to conduct large-scale object-object affordance learning without the need for human annotations or demonstrations. Experiments on large-scale synthetic data and real-world data prove the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-29T04:38:12Z)
Simultaneous Multi-View Object Recognition and Grasping in Open-Ended Domains [0.0]
We propose a deep learning architecture with augmented memory capacities to handle open-ended object recognition and grasping simultaneously. We demonstrate the ability of our approach to grasp never-seen-before objects and to rapidly learn new object categories using very few examples on-site in both simulation and real-world settings.
arXiv Detail & Related papers (2021-06-03T14:12:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.