SORNet: Spatial Object-Centric Representations for Sequential
Manipulation
- URL: http://arxiv.org/abs/2109.03891v1
- Date: Wed, 8 Sep 2021 19:36:29 GMT
- Title: SORNet: Spatial Object-Centric Representations for Sequential
Manipulation
- Authors: Wentao Yuan, Chris Paxton, Karthik Desingh, Dieter Fox
- Abstract summary: Sequential manipulation tasks require a robot to perceive the state of an environment and plan a sequence of actions leading to a desired goal state.
We propose SORNet, which extracts object-centric representations from RGB images conditioned on canonical views of the objects of interest.
- Score: 39.88239245446054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Sequential manipulation tasks require a robot to perceive the state of an
environment and plan a sequence of actions leading to a desired goal state,
where the ability to reason about spatial relationships among object entities
from raw sensor inputs is crucial. Prior works relying on explicit state
estimation or end-to-end learning struggle with novel objects. In this work, we
propose SORNet (Spatial Object-Centric Representation Network), which extracts
object-centric representations from RGB images conditioned on canonical views
of the objects of interest. We show that the object embeddings learned by
SORNet generalize zero-shot to unseen object entities on three spatial
reasoning tasks: spatial relationship classification, skill precondition
classification and relative direction regression, significantly outperforming
baselines. Further, we present real-world robotic experiments demonstrating the
usage of the learned object embeddings in task planning for sequential
manipulation.
Related papers
- Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Neural Slot Interpreters: Grounding Object Semantics in Emergent Slot Representations [4.807052027638089]
We present the Neural Slot Interpreter (NSI) that learns to ground and generate object semantics via slot representations.
NSI is an XML-like programming language that uses simple syntax rules to organize the object semantics of a scene into object-centric program primitives.
arXiv Detail & Related papers (2024-02-02T12:37:23Z) - ICGNet: A Unified Approach for Instance-Centric Grasping [42.92991092305974]
We introduce an end-to-end architecture for object-centric grasping.
We show the effectiveness of the proposed method by extensively evaluating it against state-of-the-art methods on synthetic datasets.
arXiv Detail & Related papers (2024-01-18T12:41:41Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Object-Centric Scene Representations using Active Inference [4.298360054690217]
Representing a scene and its constituent objects from raw sensory data is a core ability for enabling robots to interact with their environment.
We propose a novel approach for scene understanding, leveraging a hierarchical object-centric generative model that enables an agent to infer object category.
For evaluating the behavior of an active vision agent, we also propose a new benchmark where, given a target viewpoint of a particular object, the agent needs to find the best matching viewpoint.
arXiv Detail & Related papers (2023-02-07T06:45:19Z) - Object Manipulation via Visual Target Localization [64.05939029132394]
Training agents to manipulate objects, poses many challenges.
We propose an approach that explores the environment in search for target objects, computes their 3D coordinates once they are located, and then continues to estimate their 3D locations even when the objects are not visible.
Our evaluations show a massive 3x improvement in success rate over a model that has access to the same sensory suite.
arXiv Detail & Related papers (2022-03-15T17:59:01Z) - KINet: Unsupervised Forward Models for Robotic Pushing Manipulation [8.572983995175909]
We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation.
Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system.
By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
arXiv Detail & Related papers (2022-02-18T03:32:08Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.