Related papers: PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

PhyX: Does Your Model Have the "Wits" for Physical Reasoning?

URL: http://arxiv.org/abs/2505.15929v2
Date: Thu, 29 May 2025 17:59:14 GMT
Title: PhyX: Does Your Model Have the "Wits" for Physical Reasoning?
Authors: Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, Zhongwei Wan, Kai Zhang, Wendong Xu, Jing Xiong, Ping Luo, Wenhu Chen, Chaofan Tao, Zhuoqing Mao, Ngai Wong,
Abstract summary: Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
Score: 49.083544963243206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning, the integrated ability to combine domain knowledge, symbolic reasoning, and understanding of real-world constraints. To address this gap, we introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios. PhyX includes 3K meticulously curated multimodal questions spanning 6 reasoning types across 25 sub-domains and 6 core physics domains: thermodynamics, electromagnetism, mechanics, modern physics, optics, and wave\&acoustics. In our comprehensive evaluation, even state-of-the-art models struggle significantly with physical reasoning. GPT-4o, Claude3.7-Sonnet, and GPT-o4-mini achieve only 32.5%, 42.2%, and 45.8% accuracy respectively-performance gaps exceeding 29% compared to human experts. Our analysis exposes critical limitations in current models: over-reliance on memorized disciplinary knowledge, excessive dependence on mathematical formulations, and surface-level visual pattern matching rather than genuine physical understanding. We provide in-depth analysis through fine-grained statistics, detailed case studies, and multiple evaluation paradigms to thoroughly examine physical reasoning capabilities. To ensure reproducibility, we implement a compatible evaluation protocol based on widely-used toolkits such as VLMEvalKit, enabling one-click evaluation. More details are available on our project page: https://phyx-bench.github.io/.

Related papers

PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
IntPhys 2: Benchmarking Intuitive Physics Understanding In Complex Synthetic Environments [26.02187269408895]
IntPhys 2 is a video benchmark designed to evaluate the intuitive physics understanding of deep learning models.<n>IntPhys 2 focuses on four core principles related to macroscopic objects: Permanence, Immutability, Spatio-Temporal Continuity, and Solidity.
arXiv Detail & Related papers (2025-06-11T15:21:16Z)
SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [89.48883747910448]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z)
ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.54872845368151]
We conduct a case study using a synthetic dataset solvable only through visual reasoning.<n>We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions.<n>Although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%.
arXiv Detail & Related papers (2025-05-19T17:59:27Z)
PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models [33.45006997591683]
PHYBench is a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty.<n>PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items.<n> Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models.
arXiv Detail & Related papers (2025-04-22T17:53:29Z)
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [36.193595420239845]
We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
arXiv Detail & Related papers (2025-02-17T17:24:14Z)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
PhyBench: A Physical Commonsense Benchmark for Evaluating Text-to-Image Models [50.33699462106502]
Text-to-image (T2I) models frequently fail to produce images consistent with physical commonsense. Current T2I evaluation benchmarks focus on metrics such as accuracy, bias, and safety, neglecting the evaluation of models' internal knowledge. We introduce PhyBench, a comprehensive T2I evaluation dataset comprising 700 prompts across 4 primary categories: mechanics, optics, thermodynamics, and material properties.
arXiv Detail & Related papers (2024-06-17T17:49:01Z)
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.