Related papers: Object-Centric Representations Improve Policy Generalization in Robot Manipulation

Object-Centric Representations Improve Policy Generalization in Robot Manipulation

URL: http://arxiv.org/abs/2505.11563v1
Date: Fri, 16 May 2025 07:06:37 GMT
Title: Object-Centric Representations Improve Policy Generalization in Robot Manipulation
Authors: Alexandre Chapin, Bruno Machado, Emmanuel Dellandrea, Liming Chen,
Abstract summary: We investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities.<n>We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks.<n>Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining.
Score: 43.18545365968973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual representations are central to the learning and generalization capabilities of robotic manipulation policies. While existing methods rely on global or dense features, such representations often entangle task-relevant and irrelevant scene information, limiting robustness under distribution shifts. In this work, we investigate object-centric representations (OCR) as a structured alternative that segments visual input into a finished set of entities, introducing inductive biases that align more naturally with manipulation tasks. We benchmark a range of visual encoders-object-centric, global and dense methods-across a suite of simulated and real-world manipulation tasks ranging from simple to complex, and evaluate their generalization under diverse visual conditions including changes in lighting, texture, and the presence of distractors. Our findings reveal that OCR-based policies outperform dense and global representations in generalization settings, even without task-specific pretraining. These insights suggest that OCR is a promising direction for designing visual systems that generalize effectively in dynamic, real-world robotic environments.

Related papers

Dynamic Context-Aware Scene Reasoning Using Vision-Language Alignment in Zero-Shot Real-World Scenarios [0.0]
This work introduces a Dynamic Context-Aware Scene Reasoning framework to address zero-shot real-world scenarios.<n>The proposed approach integrates pre-trained vision transformers and large language models to align visual semantics with natural language descriptions.<n>Experiments demonstrate up to 18% improvement in scene understanding accuracy over baseline models in complex and unseen environments.
arXiv Detail & Related papers (2025-10-30T15:07:55Z)
Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z)
CRIA: A Cross-View Interaction and Instance-Adapted Pre-training Framework for Generalizable EEG Representations [52.251569042852815]
CRIA is an adaptive framework that utilizes variable-length and variable-channel coding to achieve a unified representation of EEG data across different datasets.<n>The model employs a cross-attention mechanism to fuse temporal, spectral, and spatial features effectively.<n> Experimental results on the Temple University EEG corpus and the CHB-MIT dataset show that CRIA outperforms existing methods with the same pre-training conditions.
arXiv Detail & Related papers (2025-06-19T06:31:08Z)
Zero-Shot Visual Generalization in Robot Manipulation [0.13280779791485384]
Current approaches often sidestep the problem by relying on invariant representations such as point clouds and depth.<n>Disentangled representation learning has recently shown promise in enabling vision-based reinforcement learning policies to be robust to visual distribution shifts.<n>We demonstrate zero-shot adaptability to visual perturbations in both simulation and on real hardware.
arXiv Detail & Related papers (2025-05-16T22:01:46Z)
Deep Reinforcement Learning via Object-Centric Attention [17.623937562865617]
We introduce Object-Centric Attention via Masking (OCCAM), which selectively preserves task-relevant entities while filtering out irrelevant visual information.<n>OCCAM significantly improves to novel perturbations and reduces sample complexity while showing similar or improved performance compared to conventional pixel-based RL.
arXiv Detail & Related papers (2025-04-03T20:48:27Z)
Disentangled Object-Centric Image Representation for Robotic Manipulation [6.775909411692767]
We propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment.<n>We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments.
arXiv Detail & Related papers (2025-03-14T16:33:48Z)
Salience-Invariant Consistent Policy Learning for Generalization in Visual Reinforcement Learning [12.9372563969007]
Generalizing policies to unseen scenarios remains a critical challenge in visual reinforcement learning.<n>In unseen environments, distracting pixels may lead agents to extract representations containing task-irrelevant information.<n>We propose the Salience-Invariant Consistent Policy Learning algorithm, an efficient framework for zero-shot generalization.
arXiv Detail & Related papers (2025-02-12T12:00:16Z)
Flex: End-to-End Text-Instructed Visual Navigation from Foundation Model Features [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.<n>Our findings are synthesized in Flex (Fly lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.<n>We demonstrate the effectiveness of this approach on a quadrotor fly-to-target task, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z)
Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization. We introduce a benchmark comprising eight different synthetic and real-world datasets. We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z)
MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions. Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world. We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z)
Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization [61.64304227831361]
Single-domain generalization aims to learn a model from single source domain data to achieve generalized performance on other unseen target domains. We propose a dynamic object-centric perception network based on prompt learning, aiming to adapt to the variations in image complexity.
arXiv Detail & Related papers (2024-02-28T16:16:51Z)
Learning Generalizable Manipulation Policies with Object-Centric 3D Representations [65.55352131167213]
GROOT is an imitation learning method for learning robust policies with object-centric and 3D priors. It builds policies that generalize beyond their initial training conditions for vision-based manipulation. GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances.
arXiv Detail & Related papers (2023-10-22T18:51:45Z)
Visuomotor Control in Multi-Object Scenes Using Object-Aware Representations [25.33452947179541]
We show the effectiveness of object-aware representation learning techniques for robotic tasks. Our model learns control policies in a sample-efficient manner and outperforms state-of-the-art object techniques.
arXiv Detail & Related papers (2022-05-12T19:48:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.