Geometrically-Constrained Agent for Spatial Reasoning
- URL: http://arxiv.org/abs/2511.22659v1
- Date: Thu, 27 Nov 2025 17:50:37 GMT
- Title: Geometrically-Constrained Agent for Spatial Reasoning
- Authors: Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng,
- Abstract summary: Vision Language Models exhibit a fundamental semantic-to-geometric gap in spatial reasoning.<n>Current paradigms fail to bridge this gap.<n>We propose a training-free agentic paradigm that resolves this gap by introducing a formal task constraint.
- Score: 53.93718394870856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.
Related papers
- On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z) - On the Paradoxical Interference between Instruction-Following and Task Solving [50.75960598434753]
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed.<n>We reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability.<n>We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving.
arXiv Detail & Related papers (2026-01-29T17:48:56Z) - S$^2$GR: Stepwise Semantic-Guided Reasoning in Latent Space for Generative Recommendation [15.69884243417431]
Generative Recommendation (GR) has emerged as a transformative paradigm with its end-to-end generation advantages.<n>Existing GR methods primarily focus on direct Semantic ID (SID) generation from interaction sequences.<n>We propose stepwise semantic-guided reasoning in latent space (S$2$GR), a novel reasoning enhanced GR framework.
arXiv Detail & Related papers (2026-01-26T16:40:37Z) - TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z) - Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing [8.731693840957716]
Think2Seg-RS is a framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts.<n>The framework achieves state-of-the-art performance on the EarthReason dataset.<n> compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds.
arXiv Detail & Related papers (2025-12-22T11:46:42Z) - Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation [30.86820285729615]
Affordance-Guided Coarse-to-Fine Exploration integrates semantic understanding from vision-language models with geometric feasibility.<n>Our system achieves 85% success rate, significantly outperforming classical geometric planners and VLM-based methods.
arXiv Detail & Related papers (2025-11-09T05:52:22Z) - Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search [3.0130126601831235]
Constraints-of-Thought (Const-o-T) is a framework that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths.<n>We demonstrate that Const-o-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning.
arXiv Detail & Related papers (2025-10-10T04:21:18Z) - Dense Semantic Matching with VGGT Prior [49.42199006453071]
We propose an approach that retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences.<n>Our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
arXiv Detail & Related papers (2025-09-25T14:56:11Z) - OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z) - Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z) - Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving [9.550050299909184]
We present a new approach that integrates Chain-of-Thought (CoT) with formal language.<n>The model interleaves natural language reasoning with incremental emission of solver-executable code.<n>Built on Qwen2.5-VL-7B, our new model, GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks.
arXiv Detail & Related papers (2025-08-12T17:26:23Z) - PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation [14.311585896189506]
We propose Primitive-Aware Semantic Grounding (PASG) to bridge the gap between task semantics and geometric features.<n>We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations.
arXiv Detail & Related papers (2025-08-08T03:23:33Z) - Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation [5.903105418868711]
We introduce textbfQuARC (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario.<n>We tackle two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility.<n>Our method achieves a 76.7% success rate on the benchmark, significantly outperforming the ViLa baseline.
arXiv Detail & Related papers (2025-03-17T11:01:02Z) - FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.