MMGR: Multi-Modal Generative Reasoning
- URL: http://arxiv.org/abs/2512.14691v2
- Date: Wed, 17 Dec 2025 18:42:37 GMT
- Title: MMGR: Multi-Modal Generative Reasoning
- Authors: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Wen Xiao, Jiuxiang Gu, Nanyun Peng, Junjie Hu,
- Abstract summary: We introduce MMGR, a principled evaluation framework based on five reasoning abilities.<n> MMGR evaluates generative reasoning across three domains: Abstract Reasoning, Embodied Navigation, and Physical Commonsense.<n>We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image)
- Score: 97.44203203196481
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
Related papers
- RAYNOVA: Scale-Temporal Autoregressive World Modeling in Ray Space [51.441415833480505]
RAYNOVA is a multiview world model for driving scenarios that employs a dual-causal autoregressive framework.<n>It constructs an isotropic-temporal representation across views, frames, and scales based on relative Plcker-ray positional encoding.
arXiv Detail & Related papers (2026-02-24T08:41:40Z) - Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark [48.02995109011304]
Video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning.<n>Existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning.<n>We introduce Gen-ViRe, a framework grounded in cognitive science and real-world AI applications.
arXiv Detail & Related papers (2025-11-17T19:11:39Z) - Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.00111584020834]
We conduct an empirical study to investigate whether video models are ready to serve as zero-shot reasoners.<n>We focus on the leading and popular Veo-3.<n>We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic.
arXiv Detail & Related papers (2025-10-30T17:59:55Z) - Clone Deterministic 3D Worlds with Geometrically-Regularized World Models [16.494281967592745]
World models are essential for enabling agents to think, plan, and reason effectively in complex, dynamic settings.<n>Despite rapid progress, current world models remain brittle and degrade over long horizons.<n>We propose Geometrically-Regularized World Models (GRWM), which enforces that consecutive points along a natural sensory trajectory remain close in latent representation space.
arXiv Detail & Related papers (2025-10-30T17:56:43Z) - DSG-World: Learning a 3D Gaussian World Model from Dual State Videos [14.213608866611784]
We present DSG-World, a novel end-to-end framework that explicitly constructs a 3D Gaussian World model from Dual State observations.<n>Our approach builds dual segmentation-aware Gaussian fields and enforces bidirectional photometric and semantic consistency.
arXiv Detail & Related papers (2025-06-05T16:33:32Z) - E3D-Bench: A Benchmark for End-to-End 3D Geometric Foundation Models [78.1674905950243]
We present the first comprehensive benchmark for 3D geometric foundation models (GFMs)<n>GFMs directly predict dense 3D representations in a single feed-forward pass, eliminating the need for slow or unavailable precomputed camera parameters.<n>We evaluate 16 state-of-the-art GFMs, revealing their strengths and limitations across tasks and domains.<n>All code, evaluation scripts, and processed data will be publicly released to accelerate research in 3D spatial intelligence.
arXiv Detail & Related papers (2025-06-02T17:53:09Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - TOMATO: Assessing Visual Temporal Reasoning Capabilities in Multimodal Foundation Models [55.48403691519395]
TOMATO is a novel benchmark crafted to rigorously assess MFMs' temporal reasoning capabilities in video understanding.<n>TOMATO comprises 1,484 carefully curated, human-annotated questions spanning six tasks.<n>Our comprehensive evaluation reveals a human-model performance gap of 57.3% with the best-performing model.
arXiv Detail & Related papers (2024-10-30T17:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.