Related papers: PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning

URL: http://arxiv.org/abs/2510.03185v2
Date: Thu, 30 Oct 2025 18:40:41 GMT
Title: PRISM-Physics: Causal DAG-Based Process Evaluation for Physics Reasoning
Authors: Wanjia Zhao, Qinwei Ma, Jingzhe Shi, Shirley Wu, Jiaqi Han, Yijia Xiao, Si-Yuan Chen, Xiao Luo, Ludwig Schmidt, James Zou,
Abstract summary: PRISM-Physics is a process-level evaluation framework and benchmark for complex physics reasoning problems.<n> Solutions are represented as directed acyclic graphs (DAGs) of formulas.<n>Results show that our evaluation framework is aligned with human experts' scoring.
Score: 57.868248683256574
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Benchmarks for competition-style reasoning have advanced evaluation in mathematics and programming, yet physics remains comparatively explored. Most existing physics benchmarks evaluate only final answers, which fail to capture reasoning processes, while recent stepwise methods rely on heuristic LLM-as-judge scoring or restrictive linear assumptions, limiting reliability and diagnostic validity. We introduce PRISM-Physics, a process-level evaluation framework and benchmark for complex physics reasoning problems. Solutions are represented as directed acyclic graphs (DAGs) of formulas, explicitly encoding causal dependencies among intermediate steps to enable fine-grained, interpretable, and theoretically grounded scoring. We prove the optimality of the DAG representation and the corresponding scoring policy. Combining with a fully rule-based method for symbolic formula equivalence matching that we developed, we ensure consistent validation across diverse formulations without heuristic judgments. Results show that our evaluation framework is more aligned with human experts' scoring. Experiments on state-of-the-art LLMs reveal persistent reasoning failures in physics, while step-level scoring offers both diagnostic insight and rich signals for later training. By combining structural rigor, theoretical guarantees, and symbolic validation, PRISM-Physics provides a principled foundation for advancing process-level evaluation and guiding the development of models with deeper scientific reasoning capabilities.

Related papers

EvalQReason: A Framework for Step-Level Reasoning Evaluation in Large Language Models [0.8399688944263844]
We present EvalQReason, a framework that quantifies LLM reasoning quality through step-level probability distribution analysis.<n>The framework introduces two complementary algorithms: Consecutive Step Divergence (CSD), which measures local coherence between adjacent reasoning steps, and Step-to-Final Convergence (SFC), which assesses global alignment with final answers.
arXiv Detail & Related papers (2026-02-02T16:32:40Z)
Unmasking Reasoning Processes: A Process-aware Benchmark for Evaluating Structural Mathematical Reasoning in LLMs [20.82580343824728]
Recent large language models (LLMs) achieve near-saturation accuracy on many established mathematical reasoning benchmarks.<n>This saturation stems from the dominance of template-based computation and shallow arithmetic decomposition.<n>We introduce ReasoningMath-Plus, a benchmark of 150 carefully curated problems explicitly designed to evaluate structural reasoning.
arXiv Detail & Related papers (2026-01-31T07:09:17Z)
SciIF: Benchmarking Scientific Instruction Following Towards Rigorous Scientific Intelligence [60.202862987441684]
We introduce scientific instruction following: the capability to solve problems while strictly adhering to the constraints that establish scientific validity.<n>Specifically, we introduce SciIF, a multi-discipline benchmark that evaluates this capability by pairing university-level problems with a fixed catalog of constraints.<n>By measuring both solution correctness and multi-constraint adherence, SciIF enables finegrained diagnosis of compositional reasoning failures.
arXiv Detail & Related papers (2026-01-08T09:45:58Z)
FEM-Bench: A Structured Scientific Reasoning Benchmark for Evaluating Code-Generating LLMs [2.3052479658146323]
We introduce FEM-Bench, a benchmark to evaluate the ability of LLMs to generate correct finite element method (FEM) and related code.<n>These tasks capture essential numerical and physical modeling challenges while representing only a small fraction of the complexity present in the discipline.<n>The best performing model at function writing, Gemini 3 Pro, completed 30/33 tasks at least once and 26/33 tasks all five times.
arXiv Detail & Related papers (2025-12-23T19:40:51Z)
Step-Aware Policy Optimization for Reasoning in Diffusion Large Language Models [57.42778606399764]
Diffusion language models (dLLMs) offer a promising, non-autoregressive paradigm for text generation.<n>Current reinforcement learning approaches often rely on sparse, outcome-based rewards.<n>We argue that this stems from a fundamental mismatch with the natural structure of reasoning.
arXiv Detail & Related papers (2025-10-02T00:34:15Z)
PKG-DPO: Optimizing Domain-Specific AI systems with Physics Knowledge Graphs and Direct Preference Optimization [0.0]
We introduce PKG-DPO, a novel framework that integrates Physics Knowledge Graphs (PKGs) with Direct Preference Optimization (DPO)<n>PKG-DPO 17% achieves fewer constraint violations and an 11% higher Physics Score compared to KG-DPO (knowledge graph-based DPO)<n>While our primary focus is on metal joining, the framework is broadly applicable to other multi-scale, physics-driven domains.
arXiv Detail & Related papers (2025-08-25T18:31:03Z)
CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics [71.42168240638462]
CMPhysBench is designed to assess the proficiency of Large Language Models in Condensed Matter Physics.<n>Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench.
arXiv Detail & Related papers (2025-08-25T15:32:22Z)
Test-time Scaling Techniques in Theoretical Physics -- A Comparison of Methods on the TPBench Dataset [13.530403536762064]
We evaluate a range of common test-time scaling methods on the TPBench physics dataset.<n>We develop a novel, symbolic weak-verifier framework to improve parallel scaling results.<n>Our findings highlight the power of step-wise symbolic verification for tackling complex scientific problems.
arXiv Detail & Related papers (2025-06-25T18:00:18Z)
PhySense: Principle-Based Physics Reasoning Benchmarking for Large Language Models [9.097623284579836]
Large language models (LLMs) have rapidly advanced and are increasingly capable of tackling complex scientific problems.<n>This discrepancy highlights a crucial gap in their ability to apply core physical principles for efficient and interpretable problem solving.<n>We introduce PhySense, a novel principle-based physics reasoning benchmark designed to be easily solvable by experts using guiding principles.
arXiv Detail & Related papers (2025-05-30T17:25:20Z)
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models [33.45006997591683]
PHYBench is a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty.<n>PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items.<n> Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models.
arXiv Detail & Related papers (2025-04-22T17:53:29Z)
MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z)
Evaluating Mathematical Reasoning Beyond Accuracy [50.09931172314218]
We introduce ReasonEval, a new methodology for evaluating the quality of reasoning steps.<n>We show that ReasonEval consistently outperforms baseline methods in the meta-evaluation datasets.<n>We observe that ReasonEval can play a significant role in data selection.
arXiv Detail & Related papers (2024-04-08T17:18:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.