Related papers: PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation

URL: http://arxiv.org/abs/2512.05930v1
Date: Fri, 05 Dec 2025 18:14:55 GMT
Title: PRiSM: An Agentic Multimodal Benchmark for Scientific Reasoning via Python-Grounded Evaluation
Authors: Shima Imani, Seungwhan Moon, Adel Ahmadyan, Lu Zhang, Kirmani Ahmed, Babak Damavandi,
Abstract summary: PRiSM is a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code.<n> PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent.<n>We propose five targeted evaluation tasks covering perturbation, symbolic program synthesis, robustness, reasoning correction, and ambiguity resolution.
Score: 7.0748516420242495
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Evaluating vision-language models (VLMs) in scientific domains like mathematics and physics poses unique challenges that go far beyond predicting final answers. These domains demand conceptual understanding, symbolic reasoning, and adherence to formal laws, requirements that most existing benchmarks fail to address. In particular, current datasets tend to be static, lacking intermediate reasoning steps, robustness to variations, or mechanisms for verifying scientific correctness. To address these limitations, we introduce PRiSM, a synthetic, fully dynamic, and multimodal benchmark for evaluating scientific reasoning via grounded Python code. PRiSM includes over 24,750 university-level physics and math problems, and it leverages our scalable agent-based pipeline, PrismAgent, to generate well-structured problem instances. Each problem contains dynamic textual and visual input, a generated figure, alongside rich structured outputs: executable Python code for ground truth generation and verification, and detailed step-by-step reasoning. The dynamic nature and Python-powered automated ground truth generation of our benchmark allow for fine-grained experimental auditing of multimodal VLMs, revealing failure modes, uncertainty behaviors, and limitations in scientific reasoning. To this end, we propose five targeted evaluation tasks covering generalization, symbolic program synthesis, perturbation robustness, reasoning correction, and ambiguity resolution. Through comprehensive evaluation of existing VLMs, we highlight their limitations and showcase how PRiSM enables deeper insights into their scientific reasoning capabilities.

Related papers

Grounding LLMs in Scientific Discovery via Embodied Actions [84.11877211907647]
Large Language Models (LLMs) have shown significant potential in scientific discovery but struggle to bridge the gap between theoretical reasoning and physical simulation.<n>We propose EmbodiedAct, a framework that transforms established scientific software into active embodied agents by groundings in embodied actions with a tight perception-execution loop.
arXiv Detail & Related papers (2026-02-24T07:37:18Z)
FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs [2.3052479658146323]
We introduce FEM-Bench, a benchmark to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code.<n>These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline.<n>The best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times.
arXiv Detail & Related papers (2025-12-23T19:40:51Z)
PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models [43.767942065379366]
Sycophancy is a tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence.<n>We introduce a comprehensive evaluation benchmark, textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs.<n>We observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior.
arXiv Detail & Related papers (2025-12-22T12:49:12Z)
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark [48.02995109011304]
Video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning.<n>Existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning.<n>We introduce Gen-ViRe, a framework grounded in cognitive science and real-world AI applications.
arXiv Detail & Related papers (2025-11-17T19:11:39Z)
DeceptionBench: A Comprehensive Benchmark for AI Deception Behaviors in Real-world Scenarios [57.327907850766785]
characterization of deception across realistic real-world scenarios remains underexplored.<n>We establish DeceptionBench, the first benchmark that systematically evaluates how deceptive tendencies manifest across different domains.<n>On the intrinsic dimension, we explore whether models exhibit self-interested egoistic tendencies or sycophantic behaviors that prioritize user appeasement.<n>We incorporate sustained multi-turn interaction loops to construct a more realistic simulation of real-world feedback dynamics.
arXiv Detail & Related papers (2025-10-17T10:14:26Z)
A Study of Rule Omission in Raven's Progressive Matrices [0.0]
Analogical reasoning lies at the core of human cognition and remains a fundamental challenge for artificial intelligence.<n>This study investigates the generalization capacity of modern AI systems under conditions of incomplete training.<n>Experiments reveal that although transformers demonstrate strong performance on familiar rules, their accuracy declines sharply when faced with novel or omitted rules.
arXiv Detail & Related papers (2025-10-03T15:53:28Z)
Multi-Physics: A Comprehensive Benchmark for Multimodal LLMs Reasoning on Chinese Multi-Subject Physics Problems [15.023749693065406]
We introduce textbf Multi-Physics for Chinese physics reasoning, a comprehensive benchmark that includes 5 difficulty levels.<n>We employ a dual evaluation framework to evaluate 20 different MLLMs, analyzing both final answer accuracy and the step-by-step integrity of their chain-of-thought.
arXiv Detail & Related papers (2025-09-19T10:18:48Z)
Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models [0.523693719989689]
We introduce a novel framework designed to rigorously evaluate Vision-Language Models (VLMs) on their understanding of 2D physics.<n>Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics.<n>We demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815.
arXiv Detail & Related papers (2025-09-10T04:15:01Z)
PhysGym: Benchmarking LLMs in Interactive Physics Discovery with Controlled Priors [29.988641224102164]
textscPhysGym is a novel benchmark suite and simulation platform for rigorously assessing LLM-based scientific reasoning.<n>textscPhysGym's primary contribution lies in its sophisticated control over the level of prior knowledge provided to the agent.
arXiv Detail & Related papers (2025-07-21T12:28:10Z)
Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z)
A Survey on Post-training of Large Language Models [185.51013463503946]
Large Language Models (LLMs) have fundamentally transformed natural language processing, making them indispensable across domains ranging from conversational systems to scientific exploration.<n>These challenges necessitate advanced post-training language models (PoLMs) to address shortcomings, such as restricted reasoning capacities, ethical uncertainties, and suboptimal domain-specific performance.<n>This paper presents the first comprehensive survey of PoLMs, systematically tracing their evolution across five core paradigms: Fine-tuning, which enhances task-specific accuracy; Alignment, which ensures ethical coherence and alignment with human preferences; Reasoning, which advances multi-step inference despite challenges in reward design; Integration and Adaptation, which
arXiv Detail & Related papers (2025-03-08T05:41:42Z)
MAPS: Advancing Multi-Modal Reasoning in Expert-Level Physical Science [62.96434290874878]
Current Multi-Modal Large Language Models (MLLM) have shown strong capabilities in general visual reasoning tasks.<n>We develop a new framework, named Multi-Modal Scientific Reasoning with Physics Perception and Simulation (MAPS) based on an MLLM.<n>MAPS decomposes expert-level multi-modal reasoning task into physical diagram understanding via a Physical Perception Model (PPM) and reasoning with physical knowledge via a simulator.
arXiv Detail & Related papers (2025-01-18T13:54:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.