P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
- URL: http://arxiv.org/abs/2602.09443v1
- Date: Tue, 10 Feb 2026 06:28:08 GMT
- Title: P1-VL: Bridging Visual Perception and Scientific Reasoning in Physics Olympiads
- Authors: Yun Luo, Futing Wang, Qianjia Cheng, Fangchen Yu, Haodi Lei, Jianhao Yan, Chenxi Li, Jiacheng Chen, Yufeng Zhao, Haiyuan Wan, Yuchen Zhang, Shenghe Zheng, Junchi Yao, Qingyang Zhang, Haonan He, Wenxuan Zeng, Li Sheng, Chengxing Xie, Yuxin Zuo, Yizhuo Li, Yulun Wu, Rui Huang, Dongzhan Zhou, Kai Chen, Yu Qiao, Lei Bai, Yu Cheng, Ning Ding, Bowen Zhou, Peng Ye, Ganqu Cui,
- Abstract summary: We introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning.<n>Our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models.
- Score: 91.05736019384489
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The transition from symbolic manipulation to science-grade reasoning represents a pivotal frontier for Large Language Models (LLMs), with physics serving as the critical test anchor for binding abstract logic to physical reality. Physics demands that a model maintain physical consistency with the laws governing the universe, a task that fundamentally requires multimodal perception to ground abstract logic in reality. At the Olympiad level, diagrams are often constitutive rather than illustrative, containing essential constraints, such as boundary conditions and spatial symmetries, that are absent from the text. To bridge this visual-logical gap, we introduce P1-VL, a family of open-source vision-language models engineered for advanced scientific reasoning. Our method harmonizes Curriculum Reinforcement Learning, which employs progressive difficulty expansion to stabilize post-training, with Agentic Augmentation, enabling iterative self-verification at inference. Evaluated on HiPhO, a rigorous benchmark of 13 exams from 2024-2025, our flagship P1-VL-235B-A22B becomes the first open-source Vision-Language Model (VLM) to secure 12 gold medals and achieves the state-of-the-art performance in the open-source models. Our agent-augmented system achieves the No.2 overall rank globally, trailing only Gemini-3-Pro. Beyond physics, P1-VL demonstrates remarkable scientific reasoning capacity and generalizability, establishing significant leads over base models in STEM benchmarks. By open-sourcing P1-VL, we provide a foundational step toward general-purpose physical intelligence to better align visual perceptions with abstract physical laws for machine scientific discovery.
Related papers
- LUMINA: Foundation Models for Topology Transferable ACOPF [12.543812430874508]
Foundation models in general promise to accelerate scientific computation by learning reusable representations across problem instances, yet constrained scientific systems.<n>We derive design principles for constrained scientific foundation models through systematic investigation of AC optimal power flow (ACOPF)<n>We characterize three design trade-offs: learning physics-invariant representations while respecting system-specific constraints, optimizing accuracy while ensuring constraint satisfaction, and ensuring reliability in high-impact operating regimes.
arXiv Detail & Related papers (2026-03-04T17:20:08Z) - HOLOGRAPH: Active Causal Discovery via Sheaf-Theoretic Alignment of Large Language Model Priors [12.969042037563971]
HOLOGRAPH is a framework that formalizes Large Language Models-guided causal discovery.<n>Our key insight is that coherent global causal structure corresponds to the existence of a global section.<n> Experiments on synthetic and real-world benchmarks demonstrate that HOLOGRAPH provides rigorous mathematical foundations.
arXiv Detail & Related papers (2025-12-30T21:47:05Z) - SpatialLadder: Progressive Training for Spatial Reasoning in Vision-Language Models [73.19077622773075]
We present a comprehensive methodology for building spatial intelligence progressively.<n>We introduce SpatialLadder-26k, a multimodal dataset containing 26,610 samples spanning object localization, single image, multi-view, and video spatial reasoning tasks.<n>We design a three-stage progressive training framework that establishes spatial perception through object localization, develops spatial understanding through multi-dimensional spatial tasks, and strengthens complex reasoning via reinforcement learning with verifiable rewards.
arXiv Detail & Related papers (2025-10-09T17:50:54Z) - Interpretable Physics Reasoning and Performance Taxonomy in Vision-Language Models [0.523693719989689]
We introduce a novel framework designed to rigorously evaluate Vision-Language Models (VLMs) on their understanding of 2D physics.<n>Our framework features a pragmatic scenario generator that creates a diverse testbed of over 400 problems across four core domains: Projectile Motion, Collision Dynamics, Mechanics, and Fluid Dynamics.<n>We demonstrate a strong correlation between model scale and reasoning ability, with our top-performing model, Qwen2.5-VL-7B, achieving an overall score of 0.815.
arXiv Detail & Related papers (2025-09-10T04:15:01Z) - Mimicking the Physicist's Eye:A VLM-centric Approach for Physics Formula Discovery [98.58830663687911]
VIPERR-aq1 is a multimodal model that performs Visual Induction for Equation Reasoning.<n>It integrates visual perception, trajectory data, and symbolic reasoning to emulate the scientific discovery process.<n>It consistently outperforms state-of-the-art VLM baselines in accuracy and interpretability.
arXiv Detail & Related papers (2025-08-24T14:34:21Z) - From Diagnosis to Improvement: Probing Spatio-Physical Reasoning in Vision Language Models [10.740632493925018]
Physical reasoning is a critical step towards building robust world models.<n>Recent vision language models (VLMs) have shown remarkable progress in specialized domains.<n>But their capability for physical reasoning remains largely unexplored.
arXiv Detail & Related papers (2025-08-14T15:55:48Z) - Do Vision-Language Models Have Internal World Models? Towards an Atomic Evaluation [54.3628937181904]
Internal world models (WMs) enable agents to understand the world's state and predict transitions.<n>Recent large Vision-Language Models (VLMs), such as OpenAI o3, GPT-4o and Gemini, exhibit potential as general-purpose WMs.
arXiv Detail & Related papers (2025-06-27T03:24:29Z) - SeePhys: Does Seeing Help Thinking? -- Benchmarking Vision-Based Physics Reasoning [95.2886065291234]
We present SeePhys, a large-scale multimodal benchmark for reasoning grounded in physics questions.<n>The benchmark covers 7 fundamental domains spanning the physics discipline, incorporating 21 categories of highly heterogeneous diagrams.<n>We observe that even the most advanced visual reasoning models (e.g., Gemini-2.5-pro and o4-mini) achieve sub-60% accuracy on our benchmark.
arXiv Detail & Related papers (2025-05-25T11:28:34Z) - Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation [51.750634349748736]
Text-to-video (T2V) models have made significant strides in visualizing complex prompts.
However, the capacity of these models to accurately represent intuitive physics remains largely unexplored.
We introduce PhyGenBench to evaluate physical commonsense correctness in T2V generation.
arXiv Detail & Related papers (2024-10-07T17:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.