Related papers: Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning

Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning

URL: http://arxiv.org/abs/2602.14409v1
Date: Mon, 16 Feb 2026 02:26:59 GMT
Title: Learning Proposes, Geometry Disposes: A Modular Framework for Efficient Spatial Reasoning
Authors: Haichao Zhu, Zhaorui Yang, Qian Zhang,
Abstract summary: Spatial perception aims to estimate camera motion and scene structure from visual observations.<n>Recent learning-based methods have demonstrated strong representational capacity for geometric perception.<n>In this work, we investigate an end-to-end modular framework for effective spatial reasoning.
Score: 3.5072793256984105
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatial perception aims to estimate camera motion and scene structure from visual observations, a problem traditionally addressed through geometric modeling and physical consistency constraints. Recent learning-based methods have demonstrated strong representational capacity for geometric perception and are increasingly used to augment classical geometry-centric systems in practice. However, whether learning components should directly replace geometric estimation or instead serve as intermediate modules within such pipelines remains an open question. In this work, we address this gap and investigate an end-to-end modular framework for effective spatial reasoning, where learning proposes geometric hypotheses, while geometric algorithms dispose estimation decisions. In particular, we study this principle in the context of relative camera pose estimation on RGB-D sequences. Using VGGT as a representative learning model, we evaluate learning-based pose and depth proposals under varying motion magnitudes and scene dynamics, followed by a classical point-to-plane RGB-D ICP as the geometric backend. Our experiments on the TUM RGB-D benchmark reveal three consistent findings: (1) learning-based pose proposals alone are unreliable; (2) learning-proposed geometry, when improperly aligned with camera intrinsics, can degrade performance; and (3) when learning-proposed depth is geometrically aligned and followed by a geometric disposal stage, consistent improvements emerge in moderately challenging rigid settings. These results demonstrate that geometry is not merely a refinement component, but an essential arbiter that validates and absorbs learning-based geometric observations. Our study highlights the importance of modular, geometry-aware system design for robust spatial perception.

Related papers

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
NoReGeo: Non-Reasoning Geometry Benchmark [5.288175082601994]
NoReGeo is a novel benchmark designed to evaluate the intrinsic geometric understanding of large language models (LLMs)<n>Our benchmark comprises 2,500 trivial geometric problems spanning 25 categories, each carefully crafted to be solvable purely through native geometric understanding.<n>We assess a range of state-of-the-art models on NoReGeo, including frontier models like GPT-4, observing that even the most advanced systems achieve an overall maximum of 65% accuracy in binary classification tasks.
arXiv Detail & Related papers (2026-01-15T10:22:55Z)
Physics-Informed Neural Networks for MIMO Beam Map and Environment Reconstruction [67.65578956523403]
geometry-aware feature extraction from channel state information (CSI) emerges as a pivotal methodology to bridge physical-layer measurements with network intelligence.<n>This paper proposes to explore the received signal strength ( RSS) data, without explicit 3D environment knowledge, to jointly construct the radio beam map and environmental geometry.<n>A physics-informed deep learning framework that incorporates the reflective-zone-based geometry model is proposed to learn the blockage, reflection, and scattering components.
arXiv Detail & Related papers (2025-10-24T08:17:14Z)
GIQ: Benchmarking 3D Geometric Reasoning of Vision Foundation Models with Simulated and Real Polyhedra [33.53387523266523]
We introduce GIQ, a benchmark specifically designed to evaluate the geometric reasoning capabilities of vision and vision-language foundation models.<n> GIQ comprises synthetic and real-world images of 224 diverse polyhedra.
arXiv Detail & Related papers (2025-06-09T20:11:21Z)
Position: Beyond Euclidean -- Foundation Models Should Embrace Non-Euclidean Geometries [42.83280708842304]
Euclidean space has been the de facto geometric setting for machine learning architectures.<n>At a large scale, real-world data often exhibit inherently non-Euclidean structures, such as multi-way relationships, hierarchies, symmetries, and non-isotropic scaling.<n>This paper argues that moving beyond Euclidean geometry is not merely an optional enhancement but a necessity to maintain the scaling law for the next-generation of foundation models.
arXiv Detail & Related papers (2025-04-11T18:07:33Z)
FLARE: Feed-forward Geometry, Appearance and Camera Estimation from Uncalibrated Sparse Views [100.45129752375658]
We present FLARE, a feed-forward model designed to infer high-quality camera poses and 3D geometry from uncalibrated sparse-view images.<n>Our solution features a cascaded learning paradigm with camera pose serving as the critical bridge, recognizing its essential role in mapping 3D structures onto 2D image planes.
arXiv Detail & Related papers (2025-02-17T18:54:05Z)
Geometric Point Attention Transformer for 3D Shape Reassembly [17.34739330880715]
We present a network specifically designed to address the challenges of reasoning about geometric relationships.<n>We integrate both global shape information and local pairwise geometric features, along with poses represented as rotation and translation vectors for each part.<n>We evaluate our model on both the semantic and geometric assembly tasks, showing that it outperforms previous methods in absolute pose estimation.
arXiv Detail & Related papers (2024-11-26T15:29:38Z)
Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction [14.225228781008209]
This paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision.
arXiv Detail & Related papers (2024-08-28T08:02:47Z)
Adaptive Surface Normal Constraint for Geometric Estimation from Monocular Images [56.86175251327466]
We introduce a novel approach to learn geometries such as depth and surface normal from images while incorporating geometric context. Our approach extracts geometric context that encodes the geometric variations present in the input image and correlates depth estimation with geometric constraints. Our method unifies depth and surface normal estimations within a cohesive framework, which enables the generation of high-quality 3D geometry from images.
arXiv Detail & Related papers (2024-02-08T17:57:59Z)
DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation [53.55300278592281]
We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image. Our framework makes inferences based on the rich geometric information of the object in the depth channel alone. Our framework competes with state-of-the-art approaches that require labeled real-world images.
arXiv Detail & Related papers (2021-06-27T10:41:50Z)
Self-supervised Geometric Perception [96.89966337518854]
Self-supervised geometric perception is a framework to learn a feature descriptor for correspondence matching without any ground-truth geometric model labels. We show that SGP achieves state-of-the-art performance that is on-par or superior to the supervised oracles trained using ground-truth labels.
arXiv Detail & Related papers (2021-03-04T15:34:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.