Visual Agentic AI for Spatial Reasoning with a Dynamic API
- URL: http://arxiv.org/abs/2502.06787v1
- Date: Mon, 10 Feb 2025 18:59:35 GMT
- Title: Visual Agentic AI for Spatial Reasoning with a Dynamic API
- Authors: Damiano Marsili, Rohun Agrawal, Yisong Yue, Georgia Gkioxari,
- Abstract summary: We introduce an agentic program synthesis approach to solve 3D spatial reasoning problems.
Our method overcomes limitations of prior approaches that rely on a static, human-defined API.
We show that our method outperforms prior zero-shot models for visual reasoning in 3D.
- Score: 26.759236329608935
- License:
- Abstract: Visual reasoning -- the ability to interpret the visual world -- is crucial for embodied agents that operate within three-dimensional scenes. Progress in AI has led to vision and language models capable of answering questions from images. However, their performance declines when tasked with 3D spatial reasoning. To tackle the complexity of such reasoning problems, we introduce an agentic program synthesis approach where LLM agents collaboratively generate a Pythonic API with new functions to solve common subproblems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API, allowing it to handle a wider range of queries. To assess AI capabilities for 3D understanding, we introduce a new benchmark of queries involving multiple steps of grounding and inference. We show that our method outperforms prior zero-shot models for visual reasoning in 3D and empirically validate the effectiveness of our agentic framework for 3D spatial reasoning tasks. Project website: https://glab-caltech.github.io/vadar/
Related papers
- LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.
In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.
We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z) - Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph [0.3926357402982764]
We propose a modular approach called BBQ that constructs 3D scene graph representation with metric and semantic edges.
BBQ employs robust DINO-powered associations to construct 3D object-centric map.
We show that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods.
arXiv Detail & Related papers (2024-06-11T09:57:04Z) - OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks.
Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D.
We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z) - Transcrib3D: 3D Referring Expression Resolution through Large Language Models [28.121606686759225]
We introduce Transcrib3D, an approach that brings together 3D detection methods and the emergent reasoning capabilities of large language models.
Transcrib3D achieves state-of-the-art results on 3D reference resolution benchmarks.
We show that our method enables a real robot to perform pick-and-place tasks given queries that contain challenging referring expressions.
arXiv Detail & Related papers (2024-04-30T02:48:20Z) - Think-Program-reCtify: 3D Situated Reasoning with Large Language Models [68.52240087262825]
This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment.
We propose a novel framework that leverages the planning, tool usage, and reflection capabilities of large language models (LLMs) through a ThinkProgram-reCtify loop.
Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method.
arXiv Detail & Related papers (2024-04-23T03:22:06Z) - Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models.
We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z) - ScanERU: Interactive 3D Visual Grounding based on Embodied Reference
Understanding [67.21613160846299]
Embodied Reference Understanding (ERU) is first designed for this concern.
New dataset called ScanERU is constructed to evaluate the effectiveness of this idea.
arXiv Detail & Related papers (2023-03-23T11:36:14Z) - 3D_DEN: Open-ended 3D Object Recognition using Dynamically Expandable
Networks [0.0]
We propose a new deep transfer learning approach based on a dynamic architectural method to make robots capable of open-ended learning about new 3D object categories.
Experimental results showed that the proposed model outperformed state-of-the-art approaches with regards to accuracy and also substantially minimizes computational overhead.
arXiv Detail & Related papers (2020-09-15T16:44:18Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.