SPHINX: A Synthetic Environment for Visual Perception and Reasoning
- URL: http://arxiv.org/abs/2511.20814v1
- Date: Tue, 25 Nov 2025 20:00:47 GMT
- Title: SPHINX: A Synthetic Environment for Visual Perception and Reasoning
- Authors: Md Tanvirul Alam, Saksham Aggarwal, Justin Yang Chae, Nidhi Rastogi,
- Abstract summary: We present Sphinx, a synthetic environment for visual perception and reasoning.<n>It generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions.<n>The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction.
- Score: 4.245676108236535
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: We present Sphinx, a synthetic environment for visual perception and reasoning that targets core cognitive primitives. Sphinx procedurally generates puzzles using motifs, tiles, charts, icons, and geometric primitives, each paired with verifiable ground-truth solutions, enabling both precise evaluation and large-scale dataset construction. The benchmark covers 25 task types spanning symmetry detection, geometric transformations, spatial reasoning, chart interpretation, and sequence prediction. Evaluating recent large vision-language models (LVLMs) shows that even state-of-the-art GPT-5 attains only 51.1% accuracy, well below human performance. Finally, we demonstrate that reinforcement learning with verifiable rewards (RLVR) substantially improves model accuracy on these tasks and yields gains on external visual reasoning benchmarks, highlighting its promise for advancing multimodal reasoning.
Related papers
- MentalBlackboard: Evaluating Spatial Visualization via Mathematical Transformations [33.000090283250934]
We develop MentalBlackboard, an open-ended spatial visualization benchmark for Paper Folding and Hole Punching tests.<n>Prediction experiments reveal that models struggle with applying symmetrical transformations, even when they predict the sequence of unfolding steps correctly.<n>The planning task reveals limitations of models in analyzing symmetrical relationships and in implementing the multi-stage symmetry process.
arXiv Detail & Related papers (2026-02-22T22:05:11Z) - TangramSR: Can Vision-Language Models Reason in Continuous Geometric Space? [11.222572150508332]
Humans excel at spatial reasoning tasks like Tangram puzzle assembly through cognitive processes involving mental rotation, iterative refinement, and visual feedback.<n>However, comprehensive experiments across five representative Vision-Language Models (VLMs) reveal systematic failures in continuous geometric reasoning.<n>We introduce a test-time self-refinement framework that combines in-context learning (ICL) with reward-guided feedback loops, inspired by human cognitive processes.
arXiv Detail & Related papers (2026-02-05T11:49:30Z) - An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR [0.0]
GPT-4.1-Mini consistently achieved the highest overall accuracy across all architectures.<n>Each model exhibited distinct sensitivity patterns to architectural design, underscoring that reasoning effectiveness remains model-specific.
arXiv Detail & Related papers (2025-11-14T22:50:22Z) - Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.27548620675748]
Compositional visual reasoning has emerged as a key research frontier in multimodal AI.<n>This survey systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.)<n>We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception.
arXiv Detail & Related papers (2025-08-24T11:01:51Z) - Oedipus and the Sphinx: Benchmarking and Improving Visual Language Models for Complex Graphic Reasoning [14.984593408786045]
We propose ReasonBench to evaluate the performance of visual language models (VLMs) in graphic reasoning tasks.<n>ReasonBench includes 1,613 questions from real-world intelligence tests.<n>We benchmark 11 mainstream VLMs and reveal significant limitations of current models.
arXiv Detail & Related papers (2025-08-01T05:12:38Z) - From Ideal to Real: Unified and Data-Efficient Dense Prediction for Real-World Scenarios [66.57089888022414]
We introduce DenseWorld, a benchmark spanning a broad set of 25 dense prediction tasks that correspond to urgent real-world applications.<n>We then propose DenseDiT, which exploits generative models' visual priors to perform diverse real-world dense prediction tasks through a unified strategy.<n>DenseDiT achieves superior results using less than 0.01% training data of baselines, underscoring its practical value for real-world deployment.
arXiv Detail & Related papers (2025-06-25T09:40:50Z) - Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z) - VisuRiddles: Fine-grained Perception is a Primary Bottleneck for Multimodal Large Language Models in Abstract Visual Reasoning [70.44416154144001]
Recent strides in multimodal large language models (MLLMs) have significantly advanced their performance in many reasoning tasks.<n> Abstract Visual Reasoning (AVR) remains a critical challenge, primarily due to limitations in perceiving abstract graphics.<n>We propose VisuRiddles, a benchmark for PRS, featuring tasks meticulously constructed to assess models' reasoning capacities.<n>Second, we introduce the Perceptual Riddle Synthesizer (PRS), an automated framework for generating riddles with fine-grained perceptual descriptions.
arXiv Detail & Related papers (2025-06-03T07:24:00Z) - Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z) - Human Cognitive Benchmarks Reveal Foundational Visual Gaps in MLLMs [65.93003087656754]
VisFactor is a benchmark that digitizes 20 vision-centric subtests from a well-established cognitive psychology assessment.<n>We evaluate 20 frontier Multimodal Large Language Models (MLLMs) from GPT, Gemini, Claude, LLaMA, Qwen, and SEED families.<n>The best-performing model achieves a score of only 25.19 out of 100, with consistent failures on tasks such as mental rotation, spatial relation inference, and figure-ground discrimination.
arXiv Detail & Related papers (2025-02-23T04:21:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.