Related papers: TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning

URL: http://arxiv.org/abs/2601.16520v1
Date: Fri, 23 Jan 2026 07:35:05 GMT
Title: TangramPuzzle: Evaluating Multimodal Large Language Models with Compositional Spatial Reasoning
Authors: Daixian Liu, Jiayi Kuang, Yinghui Li, Yangning Li, Di Yin, Haoyu Cao, Xing Sun, Ying Shen, Hai-Tao Zheng, Liang Lin, Philip S. Yu,
Abstract summary: We introduce a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game.<n>We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications.<n>We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints.
Score: 104.66714520975837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in visual recognition and semantic understanding. Nevertheless, their ability to perform precise compositional spatial reasoning remains largely unexplored. Existing benchmarks often involve relatively simple tasks and rely on semantic approximations or coarse relative positioning, while their evaluation metrics are typically limited and lack rigorous mathematical formulations. To bridge this gap, we introduce TangramPuzzle, a geometry-grounded benchmark designed to evaluate compositional spatial reasoning through the lens of the classic Tangram game. We propose the Tangram Construction Expression (TCE), a symbolic geometric framework that grounds tangram assemblies in exact, machine-verifiable coordinate specifications, to mitigate the ambiguity of visual approximation. We design two complementary tasks: Outline Prediction, which demands inferring global shapes from local components, and End-to-End Code Generation, which requires solving inverse geometric assembly problems. We conduct extensive evaluation experiments on advanced open-source and proprietary models, revealing an interesting insight: MLLMs tend to prioritize matching the target silhouette while neglecting geometric constraints, leading to distortions or deformations of the pieces.

Related papers

PointCoT: A Multi-modal Benchmark for Explicit 3D Geometric Reasoning [82.55361351483005]
We present PointCoT, a novel framework that empowers MLLMs with explicit Chain-of-Thought (CoT) reasoning for 3D data.<n>By leveraging a dual-stream multi-modal architecture, our method synergizes semantic appearance with geometric truth.
arXiv Detail & Related papers (2026-02-27T11:47:45Z)
Thinking with Geometry: Active Geometry Integration for Spatial Reasoning [68.59084007360615]
We propose GeoThinker, a framework that shifts paradigm passive fusion to active perception.<n>Instead of feature mixing, GeoThinker enables the model to selectively retrieve geometric evidence conditioned on its internal reasoning demands.<n>Our results indicate that the ability to actively integrate spatial structures is essential for next-generation spatial intelligence.
arXiv Detail & Related papers (2026-02-05T18:59:32Z)
Geometry of Decision Making in Language Models [19.74354232642455]
Large Language Models (LLMs) show strong generalization across diverse tasks, yet the internal decision-making processes behind their predictions remain opaque.<n>We study the geometry of hidden representations in LLMs through the lens of textitintrinsic dimension (ID)<n>We perform a large-scale study, with 28 open-weight transformer models and estimate ID across layers using multiple estimators.
arXiv Detail & Related papers (2025-11-25T13:52:46Z)
GeoRef: Referring Expressions in Geometry via Task Formulation, Synthetic Supervision, and Reinforced MLLM-based Solutions [45.70578816057097]
We introduce the task of Referring Expression (REC) for geometric problems.<n>REC evaluates whether models can localize points, shapes, and spatial relations in diagrams in response to textual prompts.<n>We generate a large-scale synthetic training dataset using a structured geometric formal language.
arXiv Detail & Related papers (2025-09-25T12:00:52Z)
Geometry-Editable and Appearance-Preserving Object Compositon [67.98806888489385]
General object composition (GOC) aims to seamlessly integrate a target object into a background scene with desired geometric properties.<n>Recent approaches derive semantic embeddings and integrate them into advanced diffusion models to enable geometry-editable generation.<n>We introduce a Disentangled Geometry-editable and Appearance-preserving Diffusion model that first leverages semantic embeddings to implicitly capture desired geometric transformations.
arXiv Detail & Related papers (2025-05-27T09:05:28Z)
GeoGramBench: Benchmarking the Geometric Program Reasoning in Modern LLMs [7.605833826892782]
We present a benchmark of 500 carefully refined problems organized by a tailored three-level taxonomy that considers geometric complexity rather than traditional mathematical reasoning complexity.<n>Our comprehensive evaluation of 17 frontier LLMs reveals consistent and pronounced deficiencies.<n>These results highlight the unique challenges posed by program-driven spatial reasoning and establish GeoGramBench as a valuable resource for advancing research in symbolic-to-spatial geometric reasoning.
arXiv Detail & Related papers (2025-05-23T09:17:07Z)
MATHGLANCE: Multimodal Large Language Models Do Not Know Where to Look in Mathematical Diagrams [65.02628814094639]
Diagrams serve as a fundamental form of visual language, representing complex concepts and their inter-relationships through structured symbols, shapes, and spatial arrangements.<n>Current benchmarks conflate perceptual and reasoning tasks, making it difficult to assess whether Multimodal Large Language Models genuinely understand mathematical diagrams beyond superficial pattern recognition.<n>We introduce MATHGLANCE, a benchmark specifically designed to isolate and evaluate mathematical perception in MLLMs.<n>We construct GeoPeP, a perception-oriented dataset of 200K structured geometry image-text annotated with geometric primitives and precise spatial relationships.
arXiv Detail & Related papers (2025-03-26T17:30:41Z)
GePBench: Evaluating Fundamental Geometric Perception for Multimodal Large Language Models [34.647839550142834]
We introduce GePBench, a novel benchmark designed to assess the geometric perception abilities of MLLMs.<n>Our evaluations reveal that current state-of-the-art MLLMs exhibit significant deficiencies in geometric perception tasks.<n>We show that models trained with GePBench data demonstrate substantial improvements on a wide range of benchmark tasks.
arXiv Detail & Related papers (2024-12-30T16:01:43Z)
RGM: A Robust Generalizable Matching Model [49.60975442871967]
We propose a deep model for sparse and dense matching, termed RGM (Robust Generalist Matching) To narrow the gap between synthetic training samples and real-world scenarios, we build a new, large-scale dataset with sparse correspondence ground truth. We are able to mix up various dense and sparse matching datasets, significantly improving the training diversity.
arXiv Detail & Related papers (2023-10-18T07:30:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.