Related papers: PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

URL: http://arxiv.org/abs/2504.16074v2
Date: Sun, 18 May 2025 14:13:34 GMT
Title: PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models
Authors: Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin, Yutong Ren, Zizhuo Fu, Jiaming Ji, Weike Wang, Xudong Tian, Anqi Lv, Laifu Man, Jianxiang Li, Feiyu Tao, Qihua Sun, Zhou Liang, Yushu Mu, Zhongxuan Li, Jing-Jun Zhang, Shutao Zhang, Xiaotian Li, Xingqi Xia, Jiawei Lin, Zheyu Shen, Jiahang Chen, Qiuhao Xiong, Binran Wang, Fengyuan Wang, Ziyang Ni, Bohan Zhang, Fan Cui, Changkun Shao, Qing-Hong Cao, Ming-xing Luo, Yaodong Yang, Muhan Zhang, Hua Xing Zhu,
Abstract summary: PHYBench is a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty.<n>PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items.<n> Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models.
Score: 33.45006997591683
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current benchmarks for evaluating the reasoning capabilities of Large Language Models (LLMs) face significant limitations: task oversimplification, data contamination, and flawed evaluation items. These deficiencies necessitate more rigorous assessment methods. To address these limitations, we introduce PHYBench, a benchmark of 500 original physics problems ranging from high school to Physics Olympiad difficulty. PHYBench addresses data contamination through original content and employs a systematic curation pipeline to eliminate flawed items. Evaluations show that PHYBench activates more tokens and provides stronger differentiation between reasoning models compared to other baselines like AIME 2024, OlympiadBench and GPQA. Even the best-performing model, Gemini 2.5 Pro, achieves only 36.9% accuracy compared to human experts' 61.9%. To further enhance evaluation precision, we introduce the Expression Edit Distance (EED) Score for mathematical expression assessment, which improves sample efficiency by 204% over binary scoring. Moreover, PHYBench effectively elicits multi-step and multi-condition reasoning, providing a platform for examining models' reasoning robustness, preferences, and deficiencies. The benchmark results and dataset are publicly available at https://www.phybench.cn/.

Related papers

Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination [67.67725938962798]
Pre-training on massive web-scale corpora leaves Qwen2.5 susceptible to data contamination in widely used benchmarks.<n>We introduce a generator that creates fully clean arithmetic problems of arbitrary length and difficulty, dubbed RandomCalculation.<n>We show that only accurate reward signals yield steady improvements that surpass the base model's performance boundary.
arXiv Detail & Related papers (2025-07-14T17:55:15Z)
Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad [0.8030359871216614]
We evaluate state-of-the-art reasoning models on six problems from the 2025 USAMO.<n>Only Gemini-2.5-Pro achieves a non-trivial score of 25%.<n>Our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks.
arXiv Detail & Related papers (2025-03-27T19:21:05Z)
Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [36.193595420239845]
We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
arXiv Detail & Related papers (2025-02-17T17:24:14Z)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>The domain of physics reasoning presents unique challenges that have received significantly less attention.<n>Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z)
LLMPhy: Complex Physical Reasoning Using Large Language Models and World Models [35.01842161084472]
We propose a new physical reasoning task and a dataset, dubbed TraySim. Our task involves predicting the dynamics of several objects on a tray that is given an external impact. We present LLMPhy, a zero-shot black-box optimization framework that leverages the physics knowledge and program synthesis abilities of LLMs. Our results show that the combination of the LLM and the physics engine leads to state-of-the-art zero-shot physical reasoning performance.
arXiv Detail & Related papers (2024-11-12T18:56:58Z)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
Exploring the Limits of Fine-grained LLM-based Physics Inference via Premise Removal Interventions [0.0]
Language models (LMs) can hallucinate when performing complex mathematical reasoning. Physical context requires that any symbolic manipulation satisfies complex semantics. We show that LMs' mathematical reasoning is not physics-informed in this setting.
arXiv Detail & Related papers (2024-04-29T02:43:23Z)
PhyRecon: Physically Plausible Neural Scene Reconstruction [81.73129450090684]
We introduce PHYRECON, the first approach to leverage both differentiable rendering and differentiable physics simulation to learn implicit surface representations. Central to this design is an efficient transformation between SDF-based implicit representations and explicit surface points. Our results also exhibit superior physical stability in physical simulators, with at least a 40% improvement across all datasets.
arXiv Detail & Related papers (2024-04-25T15:06:58Z)
OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems [62.06169250463104]
We present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions. The best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies.
arXiv Detail & Related papers (2024-02-21T18:49:26Z)
How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship. We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3. While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z)
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z)
GRASP: A novel benchmark for evaluating language GRounding And Situated Physics understanding in multimodal language models [4.354672867211922]
This paper presents GRASP, a novel benchmark to evaluate the language grounding and physical understanding capabilities of video-based multimodal large language models (LLMs) We use it to evaluate several state-of-the-art multimodal LLMs. Our evaluation reveals significant shortcomings in the language grounding and intuitive physics capabilities of these models.
arXiv Detail & Related papers (2023-11-15T15:38:28Z)
Physion++: Evaluating Physical Scene Understanding that Requires Online Inference of Different Physical Properties [100.19685489335828]
This work proposes a novel dataset and benchmark, termed Physion++, to rigorously evaluate visual physical prediction in artificial systems. We test scenarios where accurate prediction relies on estimates of properties such as mass, friction, elasticity, and deformability. We evaluate the performance of a number of state-of-the-art prediction models that span a variety of levels of learning vs. built-in knowledge, and compare that performance to a set of human predictions.
arXiv Detail & Related papers (2023-06-27T17:59:33Z)
Dynamic Visual Reasoning by Learning Differentiable Physics Models from Video and Language [92.7638697243969]
We propose a unified framework that can jointly learn visual concepts and infer physics models of objects from videos and language. This is achieved by seamlessly integrating three components: a visual perception module, a concept learner, and a differentiable physics engine.
arXiv Detail & Related papers (2021-10-28T17:59:13Z)
Visual Grounding of Learned Physical Models [66.04898704928517]
Humans intuitively recognize objects' physical properties and predict their motion, even when the objects are engaged in complicated interactions. We present a neural model that simultaneously reasons about physics and makes future predictions based on visual and dynamics priors. Experiments show that our model can infer the physical properties within a few observations, which allows the model to quickly adapt to unseen scenarios and make accurate predictions into the future.
arXiv Detail & Related papers (2020-04-28T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.