Object-centric Inference for Language Conditioned Placement: A
Foundation Model based Approach
- URL: http://arxiv.org/abs/2304.02893v1
- Date: Thu, 6 Apr 2023 06:51:15 GMT
- Title: Object-centric Inference for Language Conditioned Placement: A
Foundation Model based Approach
- Authors: Zhixuan Xu, Kechun Xu, Yue Wang, Rong Xiong
- Abstract summary: We focus on the task of language-conditioned object placement, in which a robot should generate placements that satisfy all the spatial constraints in language instructions.
We propose an object-centric framework that leverages foundation models to ground the reference objects and spatial relations for placement, which is more sample efficient and generalizable.
- Score: 12.016988248578027
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We focus on the task of language-conditioned object placement, in which a
robot should generate placements that satisfy all the spatial relational
constraints in language instructions. Previous works based on rule-based
language parsing or scene-centric visual representation have restrictions on
the form of instructions and reference objects or require large amounts of
training data. We propose an object-centric framework that leverages foundation
models to ground the reference objects and spatial relations for placement,
which is more sample efficient and generalizable. Experiments indicate that our
model can achieve a 97.75% success rate of placement with only ~0.26M trainable
parameters. Besides, our method generalizes better to both unseen objects and
instructions. Moreover, with only 25% training data, we still outperform the
top competing approach.
Related papers
- Zero-Shot Object-Centric Representation Learning [72.43369950684057]
We study current object-centric methods through the lens of zero-shot generalization.
We introduce a benchmark comprising eight different synthetic and real-world datasets.
We find that training on diverse real-world images improves transferability to unseen scenarios.
arXiv Detail & Related papers (2024-08-17T10:37:07Z) - Adapting a Foundation Model for Space-based Tasks [16.81793096235458]
In the future of space robotics, we see three core challenges which motivate the use of a foundation model adapted to space-based applications.
In this work, we demonstrate that 1) existing vision-language models are deficient visual reasoners in space-based applications, and 2) fine-tuning a vision-language model on extraterrestrial data significantly improves the quality of responses.
arXiv Detail & Related papers (2024-08-12T05:07:24Z) - Reuse Your Rewards: Reward Model Transfer for Zero-Shot Cross-Lingual Alignment [39.94156255629528]
We evaluate a simple approach for zero-shot cross-lingual alignment.
Cross-lingually aligned models are preferred by humans over unaligned models.
A different-language reward model sometimes yields better aligned models than a same-language reward model.
arXiv Detail & Related papers (2024-04-18T16:52:36Z) - ShapeGrasp: Zero-Shot Task-Oriented Grasping with Large Language Models through Geometric Decomposition [8.654140442734354]
Task-oriented grasping of unfamiliar objects is a necessary skill for robots in dynamic in-home environments.
We present a novel zero-shot task-oriented grasping method leveraging a geometric decomposition of the target object into simple convex shapes.
Our approach employs minimal essential information - the object's name and the intended task - to facilitate zero-shot task-oriented grasping.
arXiv Detail & Related papers (2024-03-26T19:26:53Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Energy-based Models are Zero-Shot Planners for Compositional Scene
Rearrangement [19.494104738436892]
We show that our framework can execute compositional instructions zero-shot in simulation and in the real world.
It outperforms language-to-action reactive policies and Large Language Model by a large margin, especially for long instructions that involve compositions of multiple concepts.
arXiv Detail & Related papers (2023-04-27T17:55:13Z) - Learning 6-DoF Object Poses to Grasp Category-level Objects by Language
Instructions [74.63313641583602]
This paper studies the task of any objects grasping from the known categories by free-form language instructions.
We bring these disciplines together on this open challenge, which is essential to human-robot interaction.
We propose a language-guided 6-DoF category-level object localization model to achieve robotic grasping by comprehending human intention.
arXiv Detail & Related papers (2022-05-09T04:25:14Z) - Learning Models as Functionals of Signed-Distance Fields for
Manipulation Planning [51.74463056899926]
This work proposes an optimization-based manipulation planning framework where the objectives are learned functionals of signed-distance fields that represent objects in the scene.
We show that representing objects as signed-distance fields not only enables to learn and represent a variety of models with higher accuracy compared to point-cloud and occupancy measure representations.
arXiv Detail & Related papers (2021-10-02T12:36:58Z) - Target-dependent UNITER: A Transformer-Based Multimodal Language
Comprehension Model for Domestic Service Robots [0.0]
We propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image.
Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets.
Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
arXiv Detail & Related papers (2021-07-02T03:11:02Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.