Related papers: Geometrically-Constrained Agent for Spatial Reasoning

Geometrically-Constrained Agent for Spatial Reasoning

URL: http://arxiv.org/abs/2511.22659v1
Date: Thu, 27 Nov 2025 17:50:37 GMT
Title: Geometrically-Constrained Agent for Spatial Reasoning
Authors: Zeren Chen, Xiaoya Lu, Zhijie Zheng, Pengrui Li, Lehan He, Yijin Zhou, Jing Shao, Bohan Zhuang, Lu Sheng,
Abstract summary: Vision Language Models exhibit a fundamental semantic-to-geometric gap in spatial reasoning.<n>Current paradigms fail to bridge this gap.<n>We propose a training-free agentic paradigm that resolves this gap by introducing a formal task constraint.
Score: 53.93718394870856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Models (VLMs) exhibit a fundamental semantic-to-geometric gap in spatial reasoning: they excel at qualitative semantic inference but their reasoning operates within a lossy semantic space, misaligned with high-fidelity geometry. Current paradigms fail to bridge this gap. Training-based methods suffer from an ``oracle paradox,'' learning flawed spatial logic from imperfect oracles. Tool-integrated methods constrain the final computation but critically leave the VLM's planning process unconstrained, resulting in geometrically flawed plans. In this work, we propose Geometrically-Constrained Agent (GCA), a training-free agentic paradigm that resolves this gap by introducing a formal task constraint. Specifically, we strategically decouples the VLM's role into two stages. First, acting as a semantic analyst, the VLM translates the user's ambiguous query into the formal, verifiable task constraint, which defines the reference frame and objective. Second, acting as a task solver, the VLM generates and executes tool calls strictly within the deterministic bounds defined by the constraint. This geometrically-constrained reasoning strategy successfully resolve the semantic-to-geometric gap, yielding a robust and verifiable reasoning pathway for spatial reasoning. Comprehensive experiments demonstrate that GCA achieves SOTA performance on multiple spatial reasoning benchmarks, surpassing existing training-based and tool-integrated methods by ~27%. Please see our homepage at https://gca-spatial-reasoning.github.io.

Related papers

On Multi-Step Theorem Prediction via Non-Parametric Structural Priors [50.16583672681106]
In this work, we explore training-free theorem prediction through the lens of in-context learning (ICL)<n>We propose Theorem Precedence Graphs, which encode temporal dependencies from historical solution traces as directed graphs, and impose explicit topological constraints that effectively prune the search space during inference.<n>Experiments on the FormalGeo7k benchmark show that our method achieves 89.29% accuracy, substantially outperforming ICL baselines and matching state-of-the-art supervised models.
arXiv Detail & Related papers (2026-03-05T06:08:50Z)
On the Paradoxical Interference between Instruction-Following and Task Solving [50.75960598434753]
Instruction following aims to align Large Language Models (LLMs) with human intent by specifying explicit constraints on how tasks should be performed.<n>We reveal a counterintuitive phenomenon: instruction following can paradoxically interfere with LLMs' task-solving capability.<n>We propose a metric, SUSTAINSCORE, to quantify the interference of instruction following with task solving.
arXiv Detail & Related papers (2026-01-29T17:48:56Z)
S$^2$GR: Stepwise Semantic-Guided Reasoning in Latent Space for Generative Recommendation [15.69884243417431]
Generative Recommendation (GR) has emerged as a transformative paradigm with its end-to-end generation advantages.<n>Existing GR methods primarily focus on direct Semantic ID (SID) generation from interaction sequences.<n>We propose stepwise semantic-guided reasoning in latent space (S$2$GR), a novel reasoning enhanced GR framework.
arXiv Detail & Related papers (2026-01-26T16:40:37Z)
TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
Bridging Semantics and Geometry: A Decoupled LVLM-SAM Framework for Reasoning Segmentation in Remote Sensing [8.731693840957716]
Think2Seg-RS is a framework that trains an LVLM prompter to control a frozen Segment Anything Model (SAM) via structured geometric prompts.<n>The framework achieves state-of-the-art performance on the EarthReason dataset.<n> compact segmenters outperform larger ones under semantic-level supervision, and that negative prompts are ineffective in heterogeneous aerial backgrounds.
arXiv Detail & Related papers (2025-12-22T11:46:42Z)
Affordance-Guided Coarse-to-Fine Exploration for Base Placement in Open-Vocabulary Mobile Manipulation [30.86820285729615]
Affordance-Guided Coarse-to-Fine Exploration integrates semantic understanding from vision-language models with geometric feasibility.<n>Our system achieves 85% success rate, significantly outperforming classical geometric planners and VLM-based methods.
arXiv Detail & Related papers (2025-11-09T05:52:22Z)
Constraints-of-Thought: A Framework for Constrained Reasoning in Language-Model-Guided Search [3.0130126601831235]
Constraints-of-Thought (Const-o-T) is a framework that enables Monte Carlo Tree Search (MCTS) focus search on semantically meaningful paths.<n>We demonstrate that Const-o-T offers a generalizable foundation for constraint-guided reasoning, enabling more efficient, constraint-aligned, and domain-adaptable planning.
arXiv Detail & Related papers (2025-10-10T04:21:18Z)
Dense Semantic Matching with VGGT Prior [49.42199006453071]
We propose an approach that retains VGGT's intrinsic strengths by reusing early feature stages, fine-tuning later ones, and adding a semantic head for bidirectional correspondences.<n>Our approach achieves superior geometry awareness, matching reliability, and manifold preservation, outperforming previous baselines.
arXiv Detail & Related papers (2025-09-25T14:56:11Z)
OmniEVA: Embodied Versatile Planner via Task-Adaptive 3D-Grounded and Embodiment-aware Reasoning [50.45036742963495]
We introduce OmniEVA, an embodied versatile planner that enables advanced embodied reasoning and task planning.<n>A Task-Adaptive 3D Grounding mechanism enables context-aware 3D grounding for diverse embodied tasks.<n>An Embodiment-Aware Reasoning framework incorporates task goals and embodiment constraints into the reasoning loop, resulting in planning decisions that are both goal-directed and executable.
arXiv Detail & Related papers (2025-09-11T10:32:22Z)
Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z)
Bridging Formal Language with Chain-of-Thought Reasoning to Geometry Problem Solving [9.550050299909184]
We present a new approach that integrates Chain-of-Thought (CoT) with formal language.<n>The model interleaves natural language reasoning with incremental emission of solver-executable code.<n>Built on Qwen2.5-VL-7B, our new model, GF-Reasoner, achieves up to 15% accuracy improvements on standard GPS benchmarks.
arXiv Detail & Related papers (2025-08-12T17:26:23Z)
PASG: A Closed-Loop Framework for Automated Geometric Primitive Extraction and Semantic Anchoring in Robotic Manipulation [14.311585896189506]
We propose Primitive-Aware Semantic Grounding (PASG) to bridge the gap between task semantics and geometric features.<n>We demonstrate PASG's effectiveness in practical robotic manipulation tasks across diverse scenarios, achieving performance comparable to manual annotations.
arXiv Detail & Related papers (2025-08-08T03:23:33Z)
Mitigating Cross-Modal Distraction and Ensuring Geometric Feasibility via Affordance-Guided and Self-Consistent MLLMs for Task Planning in Instruction-Following Manipulation [5.903105418868711]
We introduce textbfQuARC (Quantity, Analysis, Relative positioning, Collision), a new benchmark based on a food preparation scenario.<n>We tackle two major limitations of current MLLMs: cross-modal distraction and geometric infeasibility.<n>Our method achieves a 76.7% success rate on the benchmark, significantly outperforming the ViLa baseline.
arXiv Detail & Related papers (2025-03-17T11:01:02Z)
FLARE: Faithful Logic-Aided Reasoning and Exploration [47.46564769245296]
We introduce a novel approach for traversing the problem space using task decompositions.<n>We use the Large Language Models to plan a solution, soft-formalise the query into facts and predicates using a logic programming code.<n>Our method allows us to compute the faithfulness of the reasoning process w.r.t. the generated code and analyse the steps of the multi-hop search without relying on external solvers.
arXiv Detail & Related papers (2024-10-14T19:39:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.