Related papers: SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving

SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving

URL: http://arxiv.org/abs/2601.17489v1
Date: Sat, 24 Jan 2026 15:31:20 GMT
Title: SpatialMath: Spatial Comprehension-Infused Symbolic Reasoning for Mathematical Problem-Solving
Authors: Ashutosh Bajpai, Akshat Bhandari, Akshay Nambi, Tanmoy Chakraborty,
Abstract summary: Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information.<n>Current models struggle to accurately decompose intricate visual inputs and connect geometric perception with structured reasoning.<n>We propose SpatialMath, a novel Spatial-Infused Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains.
Score: 17.304596904197204
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Small-to-Medium sized Language Models (MSLMs) have demonstrated strong capabilities in integrating visual and textual information but still face significant limitations in visual comprehension and mathematical reasoning, particularly in geometric problems with diverse levels of visual infusion. Current models struggle to accurately decompose intricate visual inputs and connect perception with structured reasoning, leading to suboptimal performance. To address these challenges, we propose SpatialMath, a novel Spatial Comprehension-Infused Symbolic Reasoning Framework designed to integrate spatial representations into structured symbolic reasoning chains. SpatialMath employs a specialized perception module to extract spatially-grounded representations from visual diagrams, capturing critical geometric structures and spatial relationships. These representations are then methodically infused into symbolic reasoning chains, facilitating visual comprehension-aware structured reasoning. To this end, we introduce MATHVERSE-PLUS, a novel dataset containing structured visual interpretations and step-by-step reasoning paths for vision-intensive mathematical problems. SpatialMath significantly outperforms strong multimodal baselines, achieving up to 10 percentage points improvement over supervised fine-tuning with data augmentation in vision-intensive settings. Robustness analysis reveals that enhanced spatial representations directly improve reasoning accuracy, reinforcing the need for structured perception-to-reasoning pipelines in MSLMs.

Related papers

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning [104.66714520975837]
We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
arXiv Detail & Related papers (2026-01-23T07:35:05Z)
VisTIRA: Closing the Image-Text Modality Gap in Visual Math Reasoning via Structured Tool Integration [2.7403985180660784]
Vision-language models (VLMs) lag behind text-only language models on mathematical reasoning when the same problems are presented as images rather than text.<n>We introduce VisTIRA, a tool-integrated reasoning framework that enables structured problem solving by decomposing a given math problem (as an image) into natural language rationales.<n>We show that tool-integrated supervision improves image-based reasoning, and OCR grounding can further narrow the gap for smaller models.
arXiv Detail & Related papers (2026-01-20T19:54:49Z)
CogFlow: Bridging Perception and Reasoning through Knowledge Internalization for Visual Mathematical Problem Solving [28.57547723919984]
We present CogFlow, a novel cognitive-inspired three-stage framework that incorporates a knowledge internalization stage.<n>To guarantee faithful integration of extracted visual cues into subsequent reasoning, we introduce a Knowledge Internalization Reward model.<n>We also contribute a new dataset MathCog for model training, which contains samples with over 120K high-quality perception-reasoning aligned annotations.
arXiv Detail & Related papers (2026-01-05T08:02:18Z)
Figure It Out: Improve the Frontier of Reasoning with Executable Visual States [53.77871196174248]
Complex reasoning problems often involve implicit spatial and geometric relationships that are not explicitly encoded in text.<n>We introduce FIGR, which integrates executable visual construction into multi-turn reasoning via end-to-end reinforcement learning.<n>Experiments on eight challenging mathematical benchmarks demonstrate that FIGR outperforms strong text-only chain-of-thought baselines.
arXiv Detail & Related papers (2025-12-30T15:39:11Z)
Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z)
Explain with Visual Keypoints Like a Real Mentor! A Benchmark for Multimodal Solution Explanation [19.4261670152456]
We introduce the multimodal solution explanation task, designed to evaluate whether models can identify visual keypoints, such as auxiliary lines, points, angles, and generate explanations that incorporate these key elements essential for understanding.<n>Our empirical results show that, aside from recent large-scale open-source and closed-source models, most generalist open-source models, and even math-specialist models, struggle with the multimodal solution explanation task.<n>This highlights a significant gap in current LLMs' ability to reason and explain with visual grounding in educational contexts.
arXiv Detail & Related papers (2025-04-04T06:03:13Z)
MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams [65.02628814094639]
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.<n>Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether Multimodal Large Language Models genuinely understand mathematical diagrams beyond superficial pattern recognition.<n>We introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.<n>We construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text annotated with geometric primitives and precise spatial relationships.
arXiv Detail & Related papers (2025-03-26T17:30:41Z)
Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z)
VISTA: Visual Integrated System for Tailored Automation in Math Problem Generation Using LLM [0.5383910843560784]
This paper introduces a novel multi-agent framework that leverages Large Language Models (LLMs) to automate the creation of complex mathematical visualizations alongside coherent problem text. Our approach not only simplifies the generation of precise visual aids but also aligns these aids with the problem's core mathematical concepts, improving both problem creation and assessment.
arXiv Detail & Related papers (2024-11-08T09:15:56Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)
Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG) These visual arithmetic problems are in the form of geometric figures. We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.