Related papers: DeepThink3D: Enhancing Large Language Models with Programmatic Reasoning in Complex 3D Situated Reasoning Tasks

DeepThink3D: Enhancing Large Language Models with Programmatic Reasoning in Complex 3D Situated Reasoning Tasks

URL: http://arxiv.org/abs/2508.15548v1
Date: Thu, 21 Aug 2025 13:28:36 GMT
Title: DeepThink3D: Enhancing Large Language Models with Programmatic Reasoning in Complex 3D Situated Reasoning Tasks
Authors: Jiayi Song, Rui Wan, Lipeng Ma, Weidong Yang, Qingyuan Zhou, Yixuan Li, Ben Fei,
Abstract summary: Recent work has addressed the 3D situated reasoning task by invoking tool usage through large language models.<n>We introduce DeepThink3D to enhance the tool usage of LLMs in complex 3D situated reasoning tasks.
Score: 16.973343902054257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This work enhances the ability of large language models (LLMs) to perform complex reasoning in 3D scenes. Recent work has addressed the 3D situated reasoning task by invoking tool usage through large language models. Large language models call tools via APIs and integrate the generated programs through a chain of thought to solve problems based on the program results. However, due to the simplicity of the questions in the dataset, the generated program reasoning chains are relatively short. To solve this main challenge, in this paper, we introduce DeepThink3D to enhance the tool usage of LLMs in complex 3D situated reasoning tasks. Our work proposes a combinatorial and iterative evolutionary approach on the SQA3D benchmark to generate more complex questions. Building on this foundation, we fine-tune the large language model to make it more proficient in using 3D tools. By employing Direct Preference Optimization (DPO), we directly optimize the toolchain strategies generated by models, thereby enhancing their accuracy in complex tasks.

Related papers

VULCAN: Tool-Augmented Multi Agents for Iterative 3D Object Arrangement [66.13644883379087]
We tackle three key challenges in 3D object arrangement task using MLLMs.<n>First, to address the weak visual grounding of MLLMs, we introduce an MCP-based API.<n>Second, we augment the MLLM's 3D scene understanding with a suite of specialized visual tools.<n>Third, to manage the iterative, error-prone updates, we propose a collaborative multi-agent framework.
arXiv Detail & Related papers (2025-12-26T19:22:39Z)
Abstract 3D Perception for Spatial Intelligence in Vision-Language Models [100.13033631690114]
Vision-language models (VLMs) struggle with 3D-related tasks such as spatial cognition and physical understanding.<n>We introduce SandboxVLM, a framework that leverages abstract bounding boxes to encode geometric structure and physical kinematics for VLM.<n>Our approach consistently improves spatial intelligence, achieving an 8.3% gain on SAT Real compared with baseline methods.
arXiv Detail & Related papers (2025-11-14T04:16:09Z)
Vid-LLM: A Compact Video-based 3D Multimodal LLM with Reconstruction-Reasoning Synergy [4.1703677379815565]
We propose Vid-LLM, a video-based 3D-MLLM that directly processes video inputs without requiring external 3D data.<n>In our method, the geometric prior are directly used to improve the performance of the sceen perception.<n>Experiments across diverse benchmarks verified the effectiveness of our method on 3D Question Answering, 3D Captioning and 3D Visual Grounding tasks.
arXiv Detail & Related papers (2025-09-29T07:34:18Z)
SeqAffordSplat: Scene-level Sequential Affordance Reasoning on 3D Gaussian Splatting [85.87902260102652]
We introduce the novel task of Sequential 3D Gaussian Affordance Reasoning.<n>We then propose SeqSplatNet, an end-to-end framework that directly maps an instruction to a sequence of 3D affordance masks.<n>Our method sets a new state-of-the-art on our challenging benchmark, effectively advancing affordance reasoning from single-step interactions to complex, sequential tasks at the scene level.
arXiv Detail & Related papers (2025-07-31T17:56:55Z)
SORT3D: Spatial Object-centric Reasoning Toolbox for Zero-Shot 3D Grounding Using Large Language Models [9.279591094901152]
SORT3D is an approach that utilizes rich object attributes from 2D data and merges as-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning.<n>We show that SORT3D state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks.<n>We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments.
arXiv Detail & Related papers (2025-04-25T20:24:11Z)
Visual Agentic AI for Spatial Reasoning with a Dynamic API [26.759236329608935]
We introduce an agentic program synthesis approach to solve 3D spatial reasoning problems.<n>Our method overcomes limitations of prior approaches that rely on a static, human-defined API.<n>We show that our method outperforms prior zero-shot models for visual reasoning in 3D.
arXiv Detail & Related papers (2025-02-10T18:59:35Z)
Multimodal 3D Reasoning Segmentation with Complex Scenes [92.92045550692765]
We propose a 3D reasoning segmentation task for reasoning segmentation with multiple objects in scenes.<n>The task allows producing 3D segmentation masks and detailed textual explanations as enriched by 3D spatial relations among objects.<n>In addition, we design MORE3D, a novel 3D reasoning network that works with queries of multiple objects.
arXiv Detail & Related papers (2024-11-21T08:22:45Z)
LLMI3D: MLLM-based 3D Perception from a Single 2D Image [77.13869413871028]
multimodal large language models (MLLMs) excel in general capacity but underperform in 3D tasks.<n>In this paper, we propose solutions for weak 3D local spatial object perception, poor text-based geometric numerical output, and inability to handle camera focal variations.<n>We employ parameter-efficient fine-tuning for a pre-trained MLLM and develop LLMI3D, a powerful 3D perception MLLM.
arXiv Detail & Related papers (2024-08-14T10:00:16Z)
Reason3D: Searching and Reasoning 3D Segmentation via Large Language Model [108.35777542298224]
Reason3D processes point cloud data and text prompts to produce textual responses and segmentation masks.<n>We propose a hierarchical mask decoder that employs a coarse-to-fine approach to segment objects within expansive scenes.
arXiv Detail & Related papers (2024-05-27T17:59:41Z)
Think-Program-reCtify: 3D Situated Reasoning with Large Language Models [68.52240087262825]
This work addresses the 3D situated reasoning task which aims to answer questions given egocentric observations in a 3D environment. We propose a novel framework that leverages the planning, tool usage, and reflection capabilities of large language models (LLMs) through a ThinkProgram-reCtify loop. Experiments and analysis on the SQA3D benchmark demonstrate the effectiveness, interpretability and robustness of our method.
arXiv Detail & Related papers (2024-04-23T03:22:06Z)
ControlLLM: Augment Language Models with Tools by Searching on Graphs [97.62758830255002]
We present ControlLLM, a novel framework that enables large language models (LLMs) to utilize multi-modal tools for solving real-world tasks. Our framework comprises three key components: (1) a textittask decomposer that breaks down a complex task into clear subtasks with well-defined inputs and outputs; (2) a textitThoughts-on-Graph (ToG) paradigm that searches the optimal solution path on a pre-built tool graph; and (3) an textitexecution engine with a rich toolbox that interprets the solution path and runs the
arXiv Detail & Related papers (2023-10-26T21:57:21Z)
3D-GPT: Procedural 3D Modeling with Large Language Models [47.72968643115063]
We introduce 3D-GPT, a framework utilizing large language models(LLMs) for instruction-driven 3D modeling. 3D-GPT positions LLMs as proficient problem solvers, dissecting the procedural 3D modeling tasks into accessible segments and appointing the apt agent for each task. Our empirical investigations confirm that 3D-GPT not only interprets and executes instructions, delivering reliable results but also collaborates effectively with human designers.
arXiv Detail & Related papers (2023-10-19T17:41:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.