Physical Reasoning and Object Planning for Household Embodied Agents
- URL: http://arxiv.org/abs/2311.13577v1
- Date: Wed, 22 Nov 2023 18:32:03 GMT
- Title: Physical Reasoning and Object Planning for Household Embodied Agents
- Authors: Ayush Agrawal, Raghav Prabhakar, Anirudh Goyal, Dianbo Liu
- Abstract summary: We introduce the CommonSense Object Affordance Task (COAT), a novel framework designed to analyze reasoning capabilities in commonsense scenarios.
COAT offers insights into the complexities of practical decision-making in real-world environments.
Our contributions include insightful Object-Utility mappings addressing the first consideration and two extensive QA datasets.
- Score: 21.719773664308683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this study, we explore the sophisticated domain of task planning for
robust household embodied agents, with a particular emphasis on the intricate
task of selecting substitute objects. We introduce the CommonSense Object
Affordance Task (COAT), a novel framework designed to analyze reasoning
capabilities in commonsense scenarios. This approach is centered on
understanding how these agents can effectively identify and utilize alternative
objects when executing household tasks, thereby offering insights into the
complexities of practical decision-making in real-world environments.Drawing
inspiration from human decision-making, we explore how large language models
tackle this challenge through three meticulously crafted commonsense
question-and-answer datasets, featuring refined rules and human annotations.
Our evaluation of state-of-the-art language models on these datasets sheds
light on three pivotal considerations: 1) aligning an object's inherent utility
with the task at hand, 2) navigating contextual dependencies (societal norms,
safety, appropriateness, and efficiency), and 3) accounting for the current
physical state of the object. To maintain accessibility, we introduce five
abstract variables reflecting an object's physical condition, modulated by
human insights to simulate diverse household scenarios. Our contributions
include insightful Object-Utility mappings addressing the first consideration
and two extensive QA datasets (15k and 130k questions) probing the intricacies
of contextual dependencies and object states. The datasets, along with our
findings, are accessible at: \url{https://github.com/com-phy-affordance/COAT}.
This research not only advances our understanding of physical commonsense
reasoning in language models but also paves the way for future improvements in
household agent intelligence.
Related papers
- Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Deep Learning-Based Object Pose Estimation: A Comprehensive Survey [73.74933379151419]
We discuss the recent advances in deep learning-based object pose estimation.
Our survey also covers multiple input data modalities, degrees-of-freedom of output poses, object properties, and downstream tasks.
arXiv Detail & Related papers (2024-05-13T14:44:22Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - SPOTS: Stable Placement of Objects with Reasoning in Semi-Autonomous
Teleoperation Systems [12.180724520887853]
We focus on two aspects of the place task: stability robustness and contextual reasonableness of object placements.
Our proposed method combines simulation-driven physical stability verification via real-to-sim and the semantic reasoning capability of large language models.
arXiv Detail & Related papers (2023-09-25T08:13:49Z) - Interactive Natural Language Processing [67.87925315773924]
Interactive Natural Language Processing (iNLP) has emerged as a novel paradigm within the field of NLP.
This paper offers a comprehensive survey of iNLP, starting by proposing a unified definition and framework of the concept.
arXiv Detail & Related papers (2023-05-22T17:18:29Z) - Core Challenges in Embodied Vision-Language Planning [11.896110519868545]
Embodied Vision-Language Planning tasks leverage computer vision and natural language for interaction in physical environments.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the current and new algorithmic approaches.
We advocate for task construction that enables model generalisability and furthers real-world deployment.
arXiv Detail & Related papers (2023-04-05T20:37:13Z) - ObjectFolder: A Dataset of Objects with Implicit Visual, Auditory, and
Tactile Representations [52.226947570070784]
We present Object, a dataset of 100 objects that addresses both challenges with two key innovations.
First, Object encodes the visual, auditory, and tactile sensory data for all objects, enabling a number of multisensory object recognition tasks.
Second, Object employs a uniform, object-centric simulations, and implicit representation for each object's visual textures, tactile readings, and tactile readings, making the dataset flexible to use and easy to share.
arXiv Detail & Related papers (2021-09-16T14:00:59Z) - Knowledge-driven Data Construction for Zero-shot Evaluation in
Commonsense Question Answering [80.60605604261416]
We propose a novel neuro-symbolic framework for zero-shot question answering across commonsense tasks.
We vary the set of language models, training regimes, knowledge sources, and data generation strategies, and measure their impact across tasks.
We show that, while an individual knowledge graph is better suited for specific tasks, a global knowledge graph brings consistent gains across different tasks.
arXiv Detail & Related papers (2020-11-07T22:52:21Z) - Reinforcement Learning for Sparse-Reward Object-Interaction Tasks in a
First-person Simulated 3D Environment [73.9469267445146]
First-person object-interaction tasks in high-fidelity, 3D, simulated environments such as the AI2Thor pose significant sample-efficiency challenges for reinforcement learning agents.
We show that one can learn object-interaction tasks from scratch without supervision by learning an attentive object-model as an auxiliary task.
arXiv Detail & Related papers (2020-10-28T19:27:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.