Related papers: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints

URL: http://arxiv.org/abs/2511.18450v1
Date: Sun, 23 Nov 2025 13:42:22 GMT
Title: ORIGAMISPACE: Benchmarking Multimodal LLMs in Multi-Step Spatial Reasoning with Mathematical Constraints
Authors: Rui Xu, Dakuan Lu, Zicheng Zhao, Xiaoyu Tan, Xintao Wang, Siyu Yuan, Jiangjie Chen, Yinghui Xu,
Abstract summary: This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability.<n>We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation.
Score: 42.713620384054146
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Spatial reasoning is a key capability in the field of artificial intelligence, especially crucial in areas such as robotics, computer vision, and natural language understanding. However, evaluating the ability of multimodal large language models(MLLMs) in complex spatial reasoning still faces challenges, particularly in scenarios requiring multi-step reasoning and precise mathematical constraints. This paper introduces ORIGAMISPACE, a new dataset and benchmark designed to evaluate the multi-step spatial reasoning ability and the capacity to handle mathematical constraints of MLLMs through origami tasks. The dataset contains 350 data instances,each comprising a strictly formatted crease pattern (CP diagram), the Compiled Flat Pattern, the complete Folding Process, and the final Folded Shape Image. We propose four evaluation tasks: Pattern Prediction, Multi-step Spatial Reasoning, Spatial Relationship Prediction, and End-to-End CP Code Generation. For the CP code generation task, we design an interactive environment and explore the possibility of using reinforcement learning methods to train MLLMs. Through experiments on existing MLLMs, we initially reveal the strengths and weaknesses of these models in handling complex spatial reasoning tasks.

Related papers

PositionOCR: Augmenting Positional Awareness in Multi-Modal Models via Hybrid Specialist Integration [17.887453138676964]
We introduce PositionOCR, a parameter-efficient hybrid architecture that seamlessly integrates a text spotting model's positional strengths with an LLM's contextual reasoning.<n>This framework demonstrates outstanding multi-modal processing capabilities, particularly excelling in tasks such as text grounding and text spotting.
arXiv Detail & Related papers (2026-02-22T13:36:48Z)
NUMINA: A Natural Understanding Benchmark for Multi-dimensional Intelligence and Numerical Reasoning Abilities [13.229850508079073]
We introduce NUMINA, the first Natural Understanding benchmark for Multi-dimensional Intelligence and Numerical reasoning.<n>We evaluate the performance of various state-of-the-art LLMs on NUMINA following the Chat-Scene framework.
arXiv Detail & Related papers (2025-09-20T12:05:47Z)
MaRVL-QA: A Benchmark for Mathematical Reasoning over Visual Landscapes [1.0799568216202955]
A key frontier for Multimodal Large Language Models (MLLMs) is the ability to perform deep mathematical and spatial reasoning directly from images.<n>MaRVL-QA is a new benchmark designed to quantitatively evaluate these core reasoning skills.<n>MaRVL-QA reveals that even state-of-the-art MLLMs struggle significantly, often resorting to the superficials instead of robust spatial reasoning.
arXiv Detail & Related papers (2025-08-24T01:24:56Z)
Discrete Tokenization for Multimodal LLMs: A Comprehensive Survey [69.45421620616486]
This work presents the first structured taxonomy and analysis of discrete tokenization methods designed for large language models (LLMs)<n>We categorize 8 representative VQ variants that span classical and modern paradigms and analyze their algorithmic principles, training dynamics, and integration challenges with LLM pipelines.<n>We identify key challenges including codebook collapse, unstable gradient estimation, and modality-specific encoding constraints.
arXiv Detail & Related papers (2025-07-21T10:52:14Z)
Computational Thinking Reasoning in Large Language Models [69.28428524878885]
Computational Thinking Model (CTM) is a novel framework that incorporates computational thinking paradigms into large language models (LLMs)<n>Live code execution is seamlessly integrated into the reasoning process, allowing CTM to think by computing.<n>CTM outperforms conventional reasoning models and tool-augmented baselines in terms of accuracy, interpretability, and generalizability.
arXiv Detail & Related papers (2025-06-03T09:11:15Z)
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding [64.15606979785355]
Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored.<n>This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities?
arXiv Detail & Related papers (2025-05-22T17:59:03Z)
ET-Plan-Bench: Embodied Task-level Planning Benchmark Towards Spatial-Temporal Cognition with Foundation Models [38.89166693142495]
ET-Plan-Bench is a benchmark for embodied task planning using Large Language Models (LLMs)<n>It features a controllable and diverse set of embodied tasks varying in different levels of difficulties and complexities.<n>Our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework.
arXiv Detail & Related papers (2024-10-02T19:56:38Z)
Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering [54.486757407849915]
This study explores Large Language Models' mathematical reasoning on four financial question-answering datasets.<n>We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps.<n>We introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance.
arXiv Detail & Related papers (2024-02-17T05:10:18Z)
MuSR: Testing the Limits of Chain-of-thought with Multistep Soft Reasoning [63.80739044622555]
We introduce MuSR, a dataset for evaluating language models on soft reasoning tasks specified in a natural language narrative. This dataset has two crucial features. First, it is created through a novel neurosymbolic synthetic-to-natural generation algorithm. Second, our dataset instances are free text narratives corresponding to real-world domains of reasoning.
arXiv Detail & Related papers (2023-10-24T17:59:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.