Which objects help me to act effectively? Reasoning about physically-grounded affordances
- URL: http://arxiv.org/abs/2407.13811v1
- Date: Thu, 18 Jul 2024 11:08:57 GMT
- Title: Which objects help me to act effectively? Reasoning about physically-grounded affordances
- Authors: Anne Kemmeren, Gertjan Burghouts, Michael van Bekkum, Wouter Meijer, Jelle van Mil,
- Abstract summary: A key aspect of this understanding lies in detecting an object's affordances.
Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection.
By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters.
- Score: 0.6291443816903801
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: For effective interactions with the open world, robots should understand how interactions with known and novel objects help them towards their goal. A key aspect of this understanding lies in detecting an object's affordances, which represent the potential effects that can be achieved by manipulating the object in various ways. Our approach leverages a dialogue of large language models (LLMs) and vision-language models (VLMs) to achieve open-world affordance detection. Given open-vocabulary descriptions of intended actions and effects, the useful objects in the environment are found. By grounding our system in the physical world, we account for the robot's embodiment and the intrinsic properties of the objects it encounters. In our experiments, we have shown that our method produces tailored outputs based on different embodiments or intended effects. The method was able to select a useful object from a set of distractors. Finetuning the VLM for physical properties improved overall performance. These results underline the importance of grounding the affordance search in the physical world, by taking into account robot embodiment and the physical properties of objects.
Related papers
- PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large
Multimodal Models [58.33913881592706]
Humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before.
This work delves into infusing such physical commonsense reasoning into robotic manipulation.
We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds.
arXiv Detail & Related papers (2024-02-26T18:57:52Z) - AffordanceLLM: Grounding Affordance from Vision Language Models [36.97072698640563]
Affordance grounding refers to the task of finding the area of an object with which one can interact.
Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set.
We make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge.
arXiv Detail & Related papers (2024-01-12T03:21:02Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Physically Grounded Vision-Language Models for Robotic Manipulation [59.143640049407104]
We propose PhysObjects, an object-centric dataset of 39.6K crowd-sourced and 417K automated physical concept annotations.
We show that fine-tuning a vision-language model on PhysObjects improves its understanding of physical object concepts.
We incorporate this physically grounded VLM in an interactive framework with a large language model-based robotic planner.
arXiv Detail & Related papers (2023-09-05T20:21:03Z) - Ditto in the House: Building Articulation Models of Indoor Scenes
through Interactive Perception [31.009703947432026]
This work explores building articulation models of indoor scenes through a robot's purposeful interactions.
We introduce an interactive perception approach to this task.
We demonstrate the effectiveness of our approach in both simulation and real-world scenes.
arXiv Detail & Related papers (2023-02-02T18:22:00Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance
Learning [24.9242853417825]
We propose a unified affordance learning framework to learn object-object interaction for various tasks.
We are able to conduct large-scale object-object affordance learning without the need for human annotations or demonstrations.
Experiments on large-scale synthetic data and real-world data prove the effectiveness of the proposed approach.
arXiv Detail & Related papers (2021-06-29T04:38:12Z) - Property-Aware Robot Object Manipulation: a Generative Approach [57.70237375696411]
In this work, we focus on how to generate robot motion adapted to the hidden properties of the manipulated objects.
We explore the possibility of leveraging Generative Adversarial Networks to synthesize new actions coherent with the properties of the object.
Our results show that Generative Adversarial Nets can be a powerful tool for the generation of novel and meaningful transportation actions.
arXiv Detail & Related papers (2021-06-08T14:15:36Z) - Text-driven object affordance for guiding grasp-type recognition in
multimodal robot teaching [18.529563816600607]
This study investigates how text-driven object affordance affects image-based grasp-type recognition in robot teaching.
They created labeled datasets of first-person hand images to examine the impact of object affordance on recognition performance.
arXiv Detail & Related papers (2021-02-27T17:03:32Z) - Object Properties Inferring from and Transfer for Human Interaction
Motions [51.896592493436984]
In this paper, we present a fine-grained action recognition method that learns to infer object properties from human interaction motion alone.
We collect a large number of videos and 3D skeletal motions of the performing actors using an inertial motion capture device.
In particular, we learn to identify the interacting object, by estimating its weight, or its fragility or delicacy.
arXiv Detail & Related papers (2020-08-20T14:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.