Related papers: Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

Think-Program-reCtify: 3D Situated Reasoning with Large Language Models

URL: http://arxiv.org/abs/2404.14705v1
Date: Tue, 23 Apr 2024 03:22:06 GMT
Title: Think-Program-reCtify: 3D Situated Reasoning with Large Language Models
Authors: Qingrong He, Kejun Lin, Shizhe Chen, Anwen Hu, Qin Jin,
Abstract summary: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. We propose a novel framework that leverages the planning, tool usage, and reflection capabilities of large language models (LLMs) through a ThinkProgram-reCtify loop. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method.
Score: 68.52240087262825
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. The task remains challenging as it requires comprehensive 3D perception and complex reasoning skills. End-to-end models trained on supervised data for 3D situated reasoning suffer from data scarcity and generalization ability. Inspired by the recent success of leveraging large language models (LLMs) for visual reasoning, we propose LLM-TPC, a novel framework that leverages the planning, tool usage, and reflection capabilities of LLMs through a ThinkProgram-reCtify loop. The Think phase first decomposes the compositional question into a sequence of steps, and then the Program phase grounds each step to a piece of code and calls carefully designed 3D visual perception modules. Finally, the Rectify phase adjusts the plan and code if the program fails to execute. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method. Our code is publicly available at https://qingrongh.github.io/LLM-TPC/.

Related papers

MLLMs Need 3D-Aware Representation Supervision for Scene Understanding [14.083262551714133]
3DRS is a framework that enhances MLLM 3D representation learning by introducing supervision from pretrained 3D foundation models.<n>Our approach aligns MLLM visual features with rich 3D knowledge distilled from 3D models, effectively improving scene understanding.
arXiv Detail & Related papers (2025-06-02T17:58:24Z)
E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z)
Visual Agentic AI for Spatial Reasoning with a Dynamic API [26.759236329608935]
We introduce an agentic program synthesis approach to solve 3D spatial reasoning problems. Our method overcomes limitations of prior approaches that rely on a static, human-defined API. We show that our method outperforms prior zero-shot models for visual reasoning in 3D.
arXiv Detail & Related papers (2025-02-10T18:59:35Z)
Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We bridge the research gaps by proposing a 3D reasoning segmentation task for multiple objects in scenes. We create ReasonSeg3D, a benchmark that integrates 3D segmentation masks and 3D spatial relations with generated question-answer pairs. In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z)
LLMI3D: Empowering LLM with 3D Perception from a Single 2D Image [72.14973729674995]
Current 3D perception methods, particularly small models, struggle with processing logical reasoning, question-answering, and handling open scenario categories. We propose solutions: Spatial-Enhanced Local Feature Mining for better spatial feature extraction, 3D Query Token-Derived Info Decoding for precise geometric regression, and Geometry Projection-Based 3D Reasoning for handling camera focal length variations.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
Is a 3D-Tokenized LLM the Key to Reliable Autonomous Driving? [66.6886931183372]
We introduce DETR-style 3D perceptrons as 3D tokenizers, which connect LLM with a one-layer linear projector. Despite its simplicity, Atlas demonstrates superior performance in both 3D detection and ego planning tasks.
arXiv Detail & Related papers (2024-05-28T16:57:44Z)
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
This paper introduces Reason3D, a novel large language model for comprehensive 3D understanding. We propose a hierarchical mask decoder to locate small objects within expansive scenes. Experiments validate that Reason3D achieves remarkable results on large-scale ScanNet and Matterport3D datasets.
arXiv Detail & Related papers (2024-05-27T17:59:41Z)
Language-Image Models with 3D Understanding [59.499585515469974]
We develop a large-scale pre-training dataset for 2D and 3D called LV3D. Next, we introduce a new MLLM named Cube-LLM and pre-train it on LV3D. We show that pure data scaling makes a strong 3D perception capability without 3D specific architectural design or training objective.
arXiv Detail & Related papers (2024-05-06T17:57:27Z)
OmniDrive: A Holistic LLM-Agent Framework for Autonomous Driving with 3D Perception, Reasoning and Planning [68.45848423501927]
We propose a holistic framework for strong alignment between agent models and 3D driving tasks. Our framework starts with a novel 3D MLLM architecture that uses sparse queries to lift and compress visual representations into 3D. We propose OmniDrive-nuScenes, a new visual question-answering dataset challenging the true 3D situational awareness of a model.
arXiv Detail & Related papers (2024-05-02T17:59:24Z)
Probing the 3D Awareness of Visual Foundation Models [56.68380136809413]
We analyze the 3D awareness of visual foundation models. We conduct experiments using task-specific probes and zero-shot inference procedures on frozen features.
arXiv Detail & Related papers (2024-04-12T17:58:04Z)
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning [42.61001274381612]
We present LL3DA, a Large Language 3D Assistant that takes point cloud as direct input and respond to both textual-instructions and visual-prompts. Experiments show that LL3DA achieves remarkable results, and surpasses various 3D vision-language models on both 3D Captioning and 3D Question Answering.
arXiv Detail & Related papers (2023-11-30T16:00:23Z)
3D-Aware Visual Question Answering about Parts, Poses and Occlusions [20.83938624671415]
We introduce the task of 3D-aware VQA, which focuses on challenging questions that require a compositional reasoning over the 3D structure of visual scenes. We propose PO3D-VQA, a 3D-aware VQA model that marries two powerful ideas: probabilistic neural symbolic program execution for reasoning and deep neural networks with 3D generative representations of objects for robust visual recognition. Our experimental results show our model PO3D-VQA outperforms existing methods significantly, but we still observe a significant performance gap compared to 2D VQA benchmarks.
arXiv Detail & Related papers (2023-10-27T06:15:30Z)
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding [23.885017062031217]
3D visual grounding is the ability to localize objects in 3D scenes conditioned by utterances. Most existing methods devote the referring head to localize the referred object directly, causing failure in complex scenarios. We formulate the 3D visual grounding problem as a sequence-to-sequence Seq2Seq task by first predicting a chain of anchors and then the final target.
arXiv Detail & Related papers (2023-10-10T00:07:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.