ELBA: Learning by Asking for Embodied Visual Navigation and Task Completion
- URL: http://arxiv.org/abs/2302.04865v3
- Date: Wed, 11 Dec 2024 22:55:09 GMT
- Title: ELBA: Learning by Asking for Embodied Visual Navigation and Task Completion
- Authors: Ying Shen, Daniel Bis, Cynthia Lu, Ismini Lourentzou,
- Abstract summary: We propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task.
Experimental results show that the proposed method achieves improved task performance compared to baseline models without question-answering capabilities.
- Score: 11.068806371839424
- License:
- Abstract: The research community has shown increasing interest in designing intelligent embodied agents that can assist humans in accomplishing tasks. Although there have been significant advancements in related vision-language benchmarks, most prior work has focused on building agents that follow instructions rather than endowing agents the ability to ask questions to actively resolve ambiguities arising naturally in embodied environments. To address this gap, we propose an Embodied Learning-By-Asking (ELBA) model that learns when and what questions to ask to dynamically acquire additional information for completing the task. We evaluate ELBA on the TEACh vision-dialog navigation and task completion dataset. Experimental results show that the proposed method achieves improved task performance compared to baseline models without question-answering capabilities.
Related papers
- Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.
Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.
We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering [21.114403949257934]
Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants.
Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering or rely on closed-form choice sets.
We propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering.
arXiv Detail & Related papers (2024-10-26T19:48:47Z) - Instruction Embedding: Latent Representations of Instructions Towards Task Identification [20.327984896070053]
For instructional data, the most important aspect is the task it represents, rather than the specific semantics and knowledge information.
In this work, we introduce a new concept, instruction embedding, and construct Instruction Embedding Benchmark (IEB) for its training and evaluation.
arXiv Detail & Related papers (2024-09-29T12:12:24Z) - WESE: Weak Exploration to Strong Exploitation for LLM Agents [95.6720931773781]
This paper proposes a novel approach, Weak Exploration to Strong Exploitation (WESE) to enhance LLM agents in solving open-world interactive tasks.
WESE involves decoupling the exploration and exploitation process, employing a cost-effective weak agent to perform exploration tasks for global knowledge.
A knowledge graph-based strategy is then introduced to store the acquired knowledge and extract task-relevant knowledge, enhancing the stronger agent in success rate and efficiency for the exploitation task.
arXiv Detail & Related papers (2024-04-11T03:31:54Z) - Can Foundation Models Watch, Talk and Guide You Step by Step to Make a
Cake? [62.59699229202307]
Despite advances in AI, it remains a significant challenge to develop interactive task guidance systems.
We created a new multimodal benchmark dataset, Watch, Talk and Guide (WTaG) based on natural interaction between a human user and a human instructor.
We leveraged several foundation models to study to what extent these models can be quickly adapted to perceptually enabled task guidance.
arXiv Detail & Related papers (2023-11-01T15:13:49Z) - Active Instruction Tuning: Improving Cross-Task Generalization by
Training on Prompt Sensitive Tasks [101.40633115037983]
Instruction tuning (IT) achieves impressive zero-shot generalization results by training large language models (LLMs) on a massive amount of diverse tasks with instructions.
How to select new tasks to improve the performance and generalizability of IT models remains an open question.
We propose active instruction tuning based on prompt uncertainty, a novel framework to identify informative tasks, and then actively tune the models on the selected tasks.
arXiv Detail & Related papers (2023-11-01T04:40:05Z) - Gotta: Generative Few-shot Question Answering by Prompt-based Cloze Data
Augmentation [18.531941086922256]
Few-shot question answering (QA) aims at precisely discovering answers to a set of questions from context passages.
We develop Gotta, a Generative prOmpT-based daTa Augmentation framework.
Inspired by the human reasoning process, we propose to integrate the cloze task to enhance few-shot QA learning.
arXiv Detail & Related papers (2023-06-07T01:44:43Z) - Asking Before Acting: Gather Information in Embodied Decision Making with Language Models [20.282749796376063]
We show that Large Language Models (LLMs) encounter challenges in efficiently gathering essential information in unfamiliar environments.
We propose textitAsking Before Acting (ABA), a method that empowers the agent to proactively inquire with external sources for pertinent information using natural language.
We conduct extensive experiments involving a spectrum of environments including text-based household everyday tasks, robot arm manipulation tasks, and real world open domain image based embodied tasks.
arXiv Detail & Related papers (2023-05-25T04:05:08Z) - Task Compass: Scaling Multi-task Pre-training with Task Prefix [122.49242976184617]
Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks.
We propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks.
Our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships.
arXiv Detail & Related papers (2022-10-12T15:02:04Z) - Hierarchical Few-Shot Imitation with Skill Transition Models [66.81252581083199]
Few-shot Imitation with Skill Transition Models (FIST) is an algorithm that extracts skills from offline data and utilizes them to generalize to unseen tasks.
We show that FIST is capable of generalizing to new tasks and substantially outperforms prior baselines in navigation experiments.
arXiv Detail & Related papers (2021-07-19T15:56:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.