Related papers: Scaling Physical Reasoning with the PHYSICS Dataset

Scaling Physical Reasoning with the PHYSICS Dataset

URL: http://arxiv.org/abs/2506.00022v3
Date: Mon, 28 Jul 2025 16:50:18 GMT
Title: Scaling Physical Reasoning with the PHYSICS Dataset
Authors: Shenghe Zheng, Qianjia Cheng, Junchi Yao, Mengsong Wu, Haonan He, Ning Ding, Yu Cheng, Shuyue Hu, Lei Bai, Dongzhan Zhou, Ganqu Cui, Peng Ye,
Abstract summary: PHYSICS is a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels.<n>It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics.<n>It also spans a wide range of difficulty levels, from high school to graduate-level physics courses.
Score: 32.956687630330116
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have achieved remarkable progress on advanced reasoning tasks such as mathematics and coding competitions. Meanwhile, physics, despite being both reasoning-intensive and essential to real-world understanding, received limited academic and industrial attention. This paper introduces PHYSICS, a dataset containing 16,568 high-quality physics problems spanning subjects and difficulty levels, to facilitate this issue. Specifically, PHYSICS is curated with exercises from over 100 textbooks through a carefully designed pipeline for quality control. It covers five major physics domains: Mechanics, Electromagnetism, Thermodynamics, Optics, and Modern Physics. It also spans a wide range of difficulty levels, from high school to graduate-level physics courses. To utilize the data for improving and evaluating the model's physical reasoning capabilities, we split the dataset into training and test sets, and provide reasoning paths generated by powerful reasoning models for the training data to facilitate model training. In addition, for the evaluation part, we find that existing evaluation frameworks exhibit biases in aspects such as units, simplification, and precision in physics domain. To balance efficiency and accuracy, we introduce a Rule+Model evaluation framework tailored to physics problems. Our evaluations on current state-of-the-art open-source and proprietary models highlight the limitations of current models in handling physics-related tasks. We hope that our dataset and evaluation methodology will jointly advance the development of LLMs in the field of physics.

Related papers

PhysicsEval: Inference-Time Techniques to Improve the Reasoning Proficiency of Large Language Models on Physics Problems [3.0901186959880977]
We evaluate the performance of frontier LLMs in solving physics problems, both mathematical and descriptive.<n>We introduce a new evaluation benchmark for physics problems, $rm Psmall HYSICSEsmall VAL$, consisting of 19,609 problems sourced from various physics textbooks.
arXiv Detail & Related papers (2025-07-31T18:12:51Z)
ABench-Physics: Benchmarking Physical Reasoning in LLMs via High-Difficulty and Dynamic Physics Problems [21.278539804482012]
Large Language Models (LLMs) have shown impressive performance in domains such as mathematics and programming.<n>Physics poses unique challenges that demand not only precise computation but also deep conceptual understanding and physical modeling skills.<n>Existing benchmarks often fall short due to limited difficulty, multiple-choice formats, and static evaluation settings.
arXiv Detail & Related papers (2025-07-07T08:43:56Z)
PhysUniBench: An Undergraduate-Level Physics Reasoning Benchmark for Multimodal Models [69.73115077227969]
We present PhysUniBench, a large-scale benchmark designed to evaluate and improve the reasoning capabilities of large language models (MLLMs)<n>PhysUniBench consists of 3,304 physics questions spanning 8 major sub-disciplines of physics, each accompanied by one visual diagram.<n>The benchmark's construction involved a rigorous multi-stage process, including multiple roll-outs, expert-level evaluation, automated filtering of easily solved problems, and a nuanced difficulty grading system with five levels.
arXiv Detail & Related papers (2025-06-21T09:55:42Z)
Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z)
PhysGaia: A Physics-Aware Dataset of Multi-Body Interactions for Dynamic Novel View Synthesis [62.283499219361595]
PhysGaia is a physics-aware dataset specifically designed for Dynamic Novel View Synthesis (DyNVS)<n>Our dataset provides complex dynamic scenarios with rich interactions among multiple objects.<n>PhysGaia will significantly advance research in dynamic view synthesis, physics-based scene understanding, and deep learning models integrated with physical simulation.
arXiv Detail & Related papers (2025-06-03T12:19:18Z)
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? [49.083544963243206]
Existing benchmarks fail to capture a crucial aspect of intelligence: physical reasoning.<n>We introduce PhyX: the first large-scale benchmark designed to assess models capacity for physics-grounded reasoning in visual scenarios.
arXiv Detail & Related papers (2025-05-21T18:33:50Z)
PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning [36.193595420239845]
We present PhysReason, a 1,200-problem benchmark for evaluating large language models.<n>Problems require an average of 8.1 solution steps, with hard requiring 15.6.<n>Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation.
arXiv Detail & Related papers (2025-02-17T17:24:14Z)
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models [39.917074900737575]
Large language models (LLMs) have demonstrated remarkable capabilities in solving complex reasoning tasks.<n>The domain of physics reasoning presents unique challenges that have received significantly less attention.<n>Existing benchmarks often fall short in evaluating LLMs' abilities on the breadth and depth of undergraduate-level physics.
arXiv Detail & Related papers (2025-02-01T06:42:02Z)
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
ContPhy: Continuum Physical Concept Learning and Reasoning from Videos [86.63174804149216]
ContPhy is a novel benchmark for assessing machine physical commonsense. We evaluated a range of AI models and found that they still struggle to achieve satisfactory performance on ContPhy. We also introduce an oracle model (ContPRO) that marries the particle-based physical dynamic models with the recent large language models.
arXiv Detail & Related papers (2024-02-09T01:09:21Z)
Physics-Integrated Variational Autoencoders for Robust and Interpretable Generative Modeling [86.9726984929758]
We focus on the integration of incomplete physics models into deep generative models. We propose a VAE architecture in which a part of the latent space is grounded by physics. We demonstrate generative performance improvements over a set of synthetic and real-world datasets.
arXiv Detail & Related papers (2021-02-25T20:28:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.