GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
- URL: http://arxiv.org/abs/2412.14480v1
- Date: Thu, 19 Dec 2024 03:04:34 GMT
- Title: GraphEQA: Using 3D Semantic Scene Graphs for Real-time Embodied Question Answering
- Authors: Saumya Saxena, Blake Buchanan, Chris Paxton, Bingqing Chen, Narunas Vaskevicius, Luigi Palmieri, Jonathan Francis, Oliver Kroemer,
- Abstract summary: In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence.<n>We propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs)<n>We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration.
- Score: 23.459190671283487
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: In Embodied Question Answering (EQA), agents must explore and develop a semantic understanding of an unseen environment in order to answer a situated question with confidence. This remains a challenging problem in robotics, due to the difficulties in obtaining useful semantic representations, updating these representations online, and leveraging prior world knowledge for efficient exploration and planning. Aiming to address these limitations, we propose GraphEQA, a novel approach that utilizes real-time 3D metric-semantic scene graphs (3DSGs) and task relevant images as multi-modal memory for grounding Vision-Language Models (VLMs) to perform EQA tasks in unseen environments. We employ a hierarchical planning approach that exploits the hierarchical nature of 3DSGs for structured planning and semantic-guided exploration. Through experiments in simulation on the HM-EQA dataset and in the real world in home and office environments, we demonstrate that our method outperforms key baselines by completing EQA tasks with higher success rates and fewer planning steps.
Related papers
- Open-World 3D Scene Graph Generation for Retrieval-Augmented Reasoning [24.17324180628543]
We propose a unified framework for Open-World 3D Scene Graph Generation with Retrieval-Augmented Reasoning.<n>Our method integrates Vision-Language Models (VLMs) with retrieval-based reasoning to support multimodal exploration and language-guided interaction.<n>We evaluate our method on 3DSSG and Replica benchmarks across four tasks-scene question answering, visual grounding, instance retrieval, and task planning-demonstrating robust generalization and superior performance in diverse environments.
arXiv Detail & Related papers (2025-11-08T07:37:29Z) - KeySG: Hierarchical Keyframe-Based 3D Scene Graphs [1.5134439544218246]
KeySG represents 3D scenes as a hierarchical graph consisting of floors, rooms, objects, and functional elements.<n>We leverage VLM to extract scene information, alleviating the need to explicitly model relationship edges between objects.<n>Our approach can process complex and ambiguous queries while mitigating the scalability issues associated with large scene graphs.
arXiv Detail & Related papers (2025-10-01T15:53:27Z) - Queryable 3D Scene Representation: A Multi-Modal Framework for Semantic Reasoning and Robotic Task Planning [28.803789915686398]
3D Queryable Scene Representation (3D QSR) is a framework built on multimedia data that unifies three complementary 3D representations.<n>Built on an object-centric design, the framework integrates with large vision-language models to enable semantic queryability.<n>Results demonstrate the framework's ability to facilitate scene understanding and integrate spatial and semantic reasoning.
arXiv Detail & Related papers (2025-09-24T12:53:32Z) - Spatial Understanding from Videos: Structured Prompts Meet Simulation Data [79.52833996220059]
We present a unified framework for enhancing 3D spatial reasoning in pre-trained vision-language models without modifying their architecture.<n>This framework combines SpatialMind, a structured prompting strategy that decomposes complex scenes and questions into interpretable reasoning steps, with ScanForgeQA, a scalable question-answering dataset built from diverse 3D simulation scenes.
arXiv Detail & Related papers (2025-06-04T07:36:33Z) - Agentic 3D Scene Generation with Spatially Contextualized VLMs [67.31920821192323]
We introduce a new paradigm that enables vision-language models to generate, understand, and edit complex 3D environments.<n>We develop an agentic 3D scene generation pipeline in which the VLM iteratively reads from and updates the spatial context.<n>Results show that our framework can handle diverse and challenging inputs, achieving a level of generalization not observed in prior work.
arXiv Detail & Related papers (2025-05-26T15:28:17Z) - DSM: Building A Diverse Semantic Map for 3D Visual Grounding [4.89669292144966]
We propose a diverse semantic map construction method specifically designed for robotic agents performing 3D Visual Grounding tasks.
This method leverages multimodal large language models (VLMs) to capture the latent semantic attributes and relations of objects within the scene and creates a Diverse Semantic Map (DSM) through a geometry sliding-window map construction strategy.
Experimental results show that our method outperforms current approaches in tasks like semantic segmentation and 3D Visual Grounding, particularly excelling in overall metrics compared to the state-of-the-art.
arXiv Detail & Related papers (2025-04-11T07:18:42Z) - ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis [15.68979922374718]
ASHiTA is a framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks.
Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks.
arXiv Detail & Related papers (2025-04-09T03:22:52Z) - EmbodiedVSR: Dynamic Scene Graph-Guided Chain-of-Thought Reasoning for Visual Spatial Tasks [24.41705039390567]
EmbodiedVSR (Embodied Visual Spatial Reasoning) is a novel framework that integrates dynamic scene graph-guided Chain-of-Thought (CoT) reasoning.
Our method enables zero-shot spatial reasoning without task-specific fine-tuning.
Experiments demonstrate that our framework significantly outperforms existing MLLM-based methods in accuracy and reasoning coherence.
arXiv Detail & Related papers (2025-03-14T05:06:07Z) - 3D-AffordanceLLM: Harnessing Large Language Models for Open-Vocabulary Affordance Detection in 3D Worlds [81.14476072159049]
3D Affordance detection is a challenging problem with broad applications on various robotic tasks.
We reformulate the traditional affordance detection paradigm into textit Reasoning Affordance (IRAS) task.
We propose 3D-ADLLM, a framework designed for reasoning affordance detection in 3D open-scene.
arXiv Detail & Related papers (2025-02-27T12:29:44Z) - Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - VeriGraph: Scene Graphs for Execution Verifiable Robot Planning [33.8868315479384]
We propose VeriGraph, a framework that integrates vision-language models (VLMs) for robotic planning while verifying action feasibility.
VeriGraph employs scene graphs as an intermediate representation, capturing key objects and spatial relationships to improve plan verification and refinement.
Our approach significantly enhances task completion rates across diverse manipulation scenarios, outperforming baseline methods by 58% for language-based tasks and 30% for image-based tasks.
arXiv Detail & Related papers (2024-11-15T18:59:51Z) - Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - 3D Question Answering for City Scene Understanding [12.433903847890322]
3D multimodal question answering (MQA) plays a crucial role in scene understanding by enabling intelligent agents to comprehend their surroundings in 3D environments.
We introduce a novel 3D MQA dataset named City-3DQA for city-level scene understanding.
A new benchmark is reported and our proposed Sg-CityU achieves accuracy of 63.94 % and 63.76 % in different settings of City-3DQA.
arXiv Detail & Related papers (2024-07-24T16:22:27Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - Map-based Modular Approach for Zero-shot Embodied Question Answering [9.234108543963568]
Embodied Question Answering (EQA) serves as a benchmark task to evaluate the capability of robots to navigate within novel environments.
This paper presents a map-based modular approach to EQA, enabling real-world robots to explore and map unknown environments.
arXiv Detail & Related papers (2024-05-26T13:10:59Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting [97.52388851329667]
We introduce Marking Open-world Keypoint Affordances (MOKA) to solve robotic manipulation tasks specified by free-form language instructions.
Central to our approach is a compact point-based representation of affordance, which bridges the VLM's predictions on observed images and the robot's actions in the physical world.
We evaluate and analyze MOKA's performance on various table-top manipulation tasks including tool use, deformable body manipulation, and object rearrangement.
arXiv Detail & Related papers (2024-03-05T18:08:45Z) - DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent) [73.10899129264375]
This paper explores DoraemonGPT, a comprehensive and conceptually elegant system driven by LLMs to understand dynamic scenes.
Given a video with a question/task, DoraemonGPT begins by converting the input video into a symbolic memory that stores task-related attributes.
We extensively evaluate DoraemonGPT's effectiveness on three benchmarks and several in-the-wild scenarios.
arXiv Detail & Related papers (2024-01-16T14:33:09Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - ConceptGraphs: Open-Vocabulary 3D Scene Graphs for Perception and
Planning [125.90002884194838]
ConceptGraphs is an open-vocabulary graph-structured representation for 3D scenes.
It is built by leveraging 2D foundation models and fusing their output to 3D by multi-view association.
We demonstrate the utility of this representation through a number of downstream planning tasks.
arXiv Detail & Related papers (2023-09-28T17:53:38Z) - SayPlan: Grounding Large Language Models using 3D Scene Graphs for
Scalable Robot Task Planning [15.346150968195015]
We introduce SayPlan, a scalable approach to large-scale task planning for robotics using 3D scene graph (3DSG) representations.
We evaluate our approach on two large-scale environments spanning up to 3 floors and 36 rooms with 140 assets and objects.
arXiv Detail & Related papers (2023-07-12T12:37:55Z) - Embodied Task Planning with Large Language Models [86.63533340293361]
We propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning with physical scene constraint.
During inference, we discover the objects in the scene by extending open-vocabulary object detectors to multi-view RGB images collected in different achievable locations.
Experimental results show that the generated plan from our TaPA framework can achieve higher success rate than LLaVA and GPT-3.5 by a sizable margin.
arXiv Detail & Related papers (2023-07-04T17:58:25Z) - Towards Multimodal Multitask Scene Understanding Models for Indoor
Mobile Agents [49.904531485843464]
In this paper, we discuss the main challenge: insufficient, or even no, labeled data for real-world indoor environments.
We describe MMISM (Multi-modality input Multi-task output Indoor Scene understanding Model) to tackle the above challenges.
MMISM considers RGB images as well as sparse Lidar points as inputs and 3D object detection, depth completion, human pose estimation, and semantic segmentation as output tasks.
We show that MMISM performs on par or even better than single-task models.
arXiv Detail & Related papers (2022-09-27T04:49:19Z) - A Unified Query-based Paradigm for Point Cloud Understanding [116.30071021894317]
We present a novel Embedding-Querying paradigm (EQ-Paradigm) for 3D understanding tasks including detection, segmentation and classification.
The input is encoded in the embedding stage with an arbitrary feature extraction architecture, which is independent of tasks and heads.
This is achieved by introducing an intermediate representation, i.e., Q-representation, in the querying stage to serve as a bridge between the embedding stage and task heads.
arXiv Detail & Related papers (2022-03-02T17:17:03Z) - Situational Graphs for Robot Navigation in Structured Indoor
Environments [9.13466172688693]
We present a real-time online built Situational Graphs (S-Graphs) composed of a single graph representing the environment.
Our method utilizes odometry readings and planar surfaces extracted from 3D LiDAR scans, to construct and optimize in real-time a three layered S-Graph.
Our proposal does not only demonstrate state-of-the-art results for pose estimation of the robot, but also contributes with a metric-semantic-topological model of the environment.
arXiv Detail & Related papers (2022-02-24T16:59:06Z) - Core Challenges in Embodied Vision-Language Planning [9.190245973578698]
We discuss Embodied Vision-Language Planning tasks, a family of prominent embodied navigation and manipulation problems.
We propose a taxonomy to unify these tasks and provide an analysis and comparison of the new and current algorithmic approaches.
We advocate for task construction that enables model generalizability and furthers real-world deployment.
arXiv Detail & Related papers (2021-06-26T05:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.