PhysProver: Advancing Automatic Theorem Proving for Physics
- URL: http://arxiv.org/abs/2601.15737v1
- Date: Thu, 22 Jan 2026 08:05:32 GMT
- Title: PhysProver: Advancing Automatic Theorem Proving for Physics
- Authors: Hanning Zhang, Ruida Wang, Rui Pan, Wenyuan Wang, Bingxu Meng, Tong Zhang,
- Abstract summary: This paper presents the first approach to enhance formal theorem proving in the physics domain.<n>It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline.<n>In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver.
- Score: 17.074001092418793
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The combination of verifiable languages and LLMs has significantly influenced both the mathematical and computer science communities because it provides a rigorous foundation for theorem proving. Recent advancements in the field provide foundation models and sophisticated agentic systems pushing the boundaries of formal mathematical reasoning to approach the natural language capability of LLMs. However, little attention has been given to the formal physics reasoning, which also heavily relies on similar problem-solving and theorem-proving frameworks. To solve this problem, this paper presents, to the best of our knowledge, the first approach to enhance formal theorem proving in the physics domain. We compose a dedicated dataset PhysLeanData for the task. It is composed of theorems sampled from PhysLean and data generated by a conjecture-based formal data generation pipeline. In the training pipeline, we leverage DeepSeek-Prover-V2-7B, a strong open-source mathematical theorem prover, and apply Reinforcement Learning with Verifiable Rewards (RLVR) to train our model PhysProver. Comprehensive experiments demonstrate that, using only $\sim$5K training samples, PhysProver achieves an overall 2.4\% improvement in multiple sub-domains. Furthermore, after formal physics training, we observe 1.3\% gains on the MiniF2F-Test benchmark, which indicates non-trivial generalization beyond physics domains and enhancement for formal math capability as well. The results highlight the effectiveness and efficiency of our approach, which provides a paradigm for extending formal provers outside mathematical domains. To foster further research, we will release both our dataset and model to the community.
Related papers
- Seed-Prover 1.5: Mastering Undergraduate-Level Theorem Proving via Learning from Experience [38.07918040429112]
We present textbfSeed-Prover 1.5, a formal theorem-proving model trained via large-scale agentic reinforcement learning.<n>Using our system, we solved textbf11 out of 12 problems from Putnam 2025 within 9 hours.
arXiv Detail & Related papers (2025-12-19T06:19:55Z) - Evaluating Frontier LLMs on PhD-Level Mathematical Reasoning: A Benchmark on a Textbook in Theoretical Computer Science about Randomized Algorithms [14.853721511192736]
Large language models (LLMs) have led to breakthroughs in automated mathematical reasoning and scientific discovery.<n>We present a benchmark of four frontier models: GP-5-Thinking, Gemini-3-Pro, Claude-Sonnet-4.5-Thinking, and Grok-4.<n>We find that while the top-tier models achieve a high accuracy rate, other models lag significantly in consistency.
arXiv Detail & Related papers (2025-12-16T00:34:55Z) - Benchmarking Foundation Models with Retrieval-Augmented Generation in Olympic-Level Physics Problem Solving [56.119382216818195]
Retrieval-augmented generation (RAG) with foundation models has achieved strong performance across diverse tasks.<n>But their capacity for expert-level reasoning-such as solving Olympiad-level physics problems-remains largely unexplored.<n>We introduce PhoPile, a high-quality multimodal dataset specifically designed for Olympiad-level physics.<n>Using PhoPile, we benchmark RAG-augmented foundation models, covering both large language models (LLMs) and large multimodal models (LMMs) with multiple retrievers.
arXiv Detail & Related papers (2025-10-01T13:57:53Z) - CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics [71.42168240638462]
CMPhysBench is designed to assess the proficiency of Large Language Models in Condensed Matter Physics.<n>Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench.
arXiv Detail & Related papers (2025-08-25T15:32:22Z) - Can Theoretical Physics Research Benefit from Language Agents? [50.57057488167844]
Large Language Models (LLMs) are rapidly advancing across diverse domains, yet their application in theoretical physics research is not yet mature.<n>This position paper argues that LLM agents can potentially help accelerate theoretical, computational, and applied physics when properly integrated with domain knowledge and toolbox.<n>We envision future physics-specialized LLMs that could handle multimodal data, propose testable hypotheses, and design experiments.
arXiv Detail & Related papers (2025-06-06T16:20:06Z) - DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning [67.93945726549289]
DeepTheorem is a comprehensive informal theorem-proving framework exploiting natural language to enhance mathematical reasoning.<n>DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs.<n>We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference.
arXiv Detail & Related papers (2025-05-29T17:59:39Z) - Alchemy: Amplifying Theorem-Proving Capability through Symbolic Mutation [71.32761934724867]
This work proposes Alchemy, a framework for data synthesis that constructs formal theorems through symbolic mutation.<n>For each candidate theorem in Mathlib, we identify all invocable theorems that can be used to rewrite or apply to it.<n>As a result, our method increases the number of theorems in Mathlib by an order of magnitude, from 110k to 6M.
arXiv Detail & Related papers (2024-10-21T08:04:21Z) - LeanAgent: Lifelong Learning for Formal Theorem Proving [85.39415834798385]
We present LeanAgent, a novel lifelong learning framework for formal theorem proving.<n>LeanAgent continuously generalizes to and improves on ever-expanding mathematical knowledge.<n>It generates formal proofs for 155 theorems across 23 diverse Lean repositories.
arXiv Detail & Related papers (2024-10-08T17:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.