Related papers: EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering

URL: http://arxiv.org/abs/2410.20263v1
Date: Sat, 26 Oct 2024 19:48:47 GMT
Title: EfficientEQA: An Efficient Approach for Open Vocabulary Embodied Question Answering
Authors: Kai Cheng, Zhengyuan Li, Xingpeng Sun, Byung-Cheol Min, Amrit Singh Bedi, Aniket Bera,
Abstract summary: Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering or rely on closed-form choice sets. We propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering.
Score: 21.114403949257934
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied Question Answering (EQA) is an essential yet challenging task for robotic home assistants. Recent studies have shown that large vision-language models (VLMs) can be effectively utilized for EQA, but existing works either focus on video-based question answering without embodied exploration or rely on closed-form choice sets. In real-world scenarios, a robotic agent must efficiently explore and accurately answer questions in open-vocabulary settings. To address these challenges, we propose a novel framework called EfficientEQA for open-vocabulary EQA, which enables efficient exploration and accurate answering. In EfficientEQA, the robot actively explores unknown environments using Semantic-Value-Weighted Frontier Exploration, a strategy that prioritizes exploration based on semantic importance provided by calibrated confidence from black-box VLMs to quickly gather relevant information. To generate accurate answers, we employ Retrieval-Augmented Generation (RAG), which utilizes BLIP to retrieve useful images from accumulated observations and VLM reasoning to produce responses without relying on predefined answer choices. Additionally, we detect observations that are highly relevant to the question as outliers, allowing the robot to determine when it has sufficient information to stop exploring and provide an answer. Experimental results demonstrate the effectiveness of our approach, showing an improvement in answering accuracy by over 15% and efficiency, measured in running steps, by over 20% compared to state-of-the-art methods.

Related papers

Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering [87.76784654371312]
Embodied Question Answering requires agents to dynamically explore 3D environments, actively gather visual information, and perform multi-step reasoning to answer questions. Existing datasets often introduce biases or prior knowledge, leading to disembodied reasoning. We construct the largest dataset designed specifically to evaluate both exploration and reasoning capabilities.
arXiv Detail & Related papers (2025-03-14T06:29:47Z)
Open-Ended and Knowledge-Intensive Video Question Answering [20.256081440725353]
We investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation. Our analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models. We achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset.
arXiv Detail & Related papers (2025-02-17T12:40:35Z)
Divide-and-Conquer: Tree-structured Strategy with Answer Distribution Estimator for Goal-Oriented Visual Dialogue [30.126882554391837]
Tree-Structured Strategy with Answer Distribution Estimator (TSADE) We propose a Tree-Structured Strategy with Answer Distribution Estimator (TSADE) which guides the question generation by excluding half of the current candidate objects in each round. We experimentally demonstrate that our method can enable the agents to achieve high task-oriented accuracy with fewer repeating questions and rounds compared to traditional ergodic question generation approaches.
arXiv Detail & Related papers (2025-02-09T08:16:09Z)
DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [92.87532210660456]
We propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP) By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step. Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.
arXiv Detail & Related papers (2025-02-03T08:22:45Z)
Retriever-and-Memory: Towards Adaptive Note-Enhanced Retrieval-Augmented Generation [72.70046559930555]
We propose a generic RAG approach called Adaptive Note-Enhanced RAG (Adaptive-Note) for complex QA tasks. Specifically, Adaptive-Note introduces an overarching view of knowledge growth, iteratively gathering new information in the form of notes. In addition, we employ an adaptive, note-based stop-exploration strategy to decide "what to retrieve and when to stop" to encourage sufficient knowledge exploration.
arXiv Detail & Related papers (2024-10-11T14:03:29Z)
RAVEN: Multitask Retrieval Augmented Vision-Language Learning [5.1583788731239455]
The scaling of large language models to encode all the world's knowledge is unsustainable and has exacerbated resource barriers. Retrieval-Augmented Generation (RAG) presents a potential solution, yet its application to vision-language models (VLMs) is under explored. This paper introduces RAVEN, a retrieval augmented VLM framework that enhances base VLMs through efficient, task specific fine-tuning.
arXiv Detail & Related papers (2024-06-27T13:08:35Z)
Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation [9.390902237835457]
We propose a new method to measure the task-specific accuracy of Retrieval-Augmented Large Language Models (RAG) Evaluation is performed by scoring the RAG on an automatically-generated synthetic exam composed of multiple choice questions.
arXiv Detail & Related papers (2024-05-22T13:14:11Z)
WESE: Weak Exploration to Strong Exploitation for LLM Agents [95.6720931773781]
This paper proposes a novel approach, Weak Exploration to Strong Exploitation (WESE) to enhance LLM agents in solving open-world interactive tasks. WESE involves decoupling the exploration and exploitation process, employing a cost-effective weak agent to perform exploration tasks for global knowledge. A knowledge graph-based strategy is then introduced to store the acquired knowledge and extract task-relevant knowledge, enhancing the stronger agent in success rate and efficiency for the exploitation task.
arXiv Detail & Related papers (2024-04-11T03:31:54Z)
Explore until Confident: Efficient Exploration for Embodied Question Answering [32.27111287314288]
We leverage the strong semantic reasoning capabilities of large vision-language models to efficiently explore and answer questions. We propose a method that first builds a semantic map of the scene based on depth information and via visual prompting of a VLM. Next, we use conformal prediction to calibrate the VLM's question answering confidence, allowing the robot to know when to stop exploration.
arXiv Detail & Related papers (2024-03-23T22:04:03Z)
Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations [9.444540281544715]
We introduce a novel agent for active open-vocabulary recognition. The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features, without relying on class-specific knowledge.
arXiv Detail & Related papers (2023-11-28T19:24:07Z)
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z)
Self-Knowledge Guided Retrieval Augmentation for Large Language Models [59.771098292611846]
Large language models (LLMs) have shown superior performance without task-specific fine-tuning. Retrieval-based methods can offer non-parametric world knowledge and improve the performance on tasks such as question answering. Self-Knowledge guided Retrieval augmentation (SKR) is a simple yet effective method which can let LLMs refer to the questions they have previously encountered.
arXiv Detail & Related papers (2023-10-08T04:22:33Z)
A Survey for Efficient Open Domain Question Answering [51.67110249787223]
Open domain question answering (ODQA) is a longstanding task aimed at answering factual questions from a large knowledge corpus without any explicit evidence in natural language processing (NLP)
arXiv Detail & Related papers (2022-11-15T04:18:53Z)
Visual Question Answering with Prior Class Semantics [50.845003775809836]
We show how to exploit additional information pertaining to the semantics of candidate answers. We extend the answer prediction process with a regression objective in a semantic space. Our method brings improvements in consistency and accuracy over a range of question types.
arXiv Detail & Related papers (2020-05-04T02:46:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.