Fugu-MT 論文翻訳(概要): HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

論文の概要: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

arxiv url: http://arxiv.org/abs/2509.07894v1
Date: Tue, 09 Sep 2025 16:24:51 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-10 14:38:27.398216
Title: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?
Title（参考訳）: HiPhO: 最新の高校の物理オリンピックベンチマークで人間から(M)LLMはどのくらいあるか?
Authors: Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, Yu Cheng, Tao Chen, Lei Bai, Dongzhan Zhou, Yun Luo, Ganqu Cui, Peng Ye,
Abstract要約: HiPhOは、人間による評価を備えた、高校の物理学オリンピアードのための最初のベンチマークである。 2024年から2025年にかけて13回のオリンピアード試験をコンパイルし、国際大会と地域競技の両方にまたがる。我々は、(M)LLMとヒトの競技者との直接比較を可能にするため、公式メダル閾値に基づくモデルに金、銀、銅のメダルを割り当てる。
参考スコア（独自算出の注目度）: 53.76627321546095
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, the physical capabilities of (M)LLMs have garnered increasing attention. However, existing benchmarks for physics suffer from two major gaps: they neither provide systematic and up-to-date coverage of real-world physics competitions such as physics Olympiads, nor enable direct performance comparison with humans. To bridge these gaps, we present HiPhO, the first benchmark dedicated to high school physics Olympiads with human-aligned evaluation. Specifically, HiPhO highlights three key innovations. (1) Comprehensive Data: It compiles 13 latest Olympiad exams from 2024-2025, spanning both international and regional competitions, and covering mixed modalities that encompass problems spanning text-only to diagram-based. (2) Professional Evaluation: We adopt official marking schemes to perform fine-grained grading at both the answer and step level, fully aligned with human examiners to ensure high-quality and domain-specific evaluation. (3) Comparison with Human Contestants: We assign gold, silver, and bronze medals to models based on official medal thresholds, thereby enabling direct comparison between (M)LLMs and human contestants. Our large-scale evaluation of 30 state-of-the-art (M)LLMs shows that: across 13 exams, open-source MLLMs mostly remain at or below the bronze level; open-source LLMs show promising progress with occasional golds; closed-source reasoning MLLMs can achieve 6 to 12 gold medals; and most models still have a significant gap from full marks. These results highlight a substantial performance gap between open-source models and top students, the strong physical reasoning capabilities of closed-source reasoning models, and the fact that there is still significant room for improvement. HiPhO, as a rigorous, human-aligned, and Olympiad-focused benchmark for advancing multimodal physical reasoning, is open-source and available at https://github.com/SciYu/HiPhO.
Abstract（参考訳）: 近年, (M)LLMの物理的能力は増加傾向にある。しかし、既存の物理学ベンチマークには2つの大きなギャップがある: 物理オリンピアードのような現実世界の物理学コンペティションの体系的および最新のカバレッジを提供したり、人間と直接のパフォーマンス比較を可能にしたりしない。これらのギャップを埋めるために,人力による評価を施した高校物理オリンピアードのための最初のベンチマークであるHiPhOを提示する。特に、HiPhOは3つの重要なイノベーションを強調している。 1)総合データ:2024年から2025年までの13回のオリンピアード試験をコンパイルし,国際競争と地域競争を対象とし,テキストのみから図ベースの問題を含む多種多様なモダリティを網羅した。 2) 専門的評価: 回答レベルとステップレベルの両方できめ細かい格付けを行うための公式なマーキング手法を採用し, 高品質でドメイン固有の評価を確実にするために, ヒト検査官と完全に一致させた。 (3) 競技者との比較: 公式のメダル閾値に基づくモデルに金、銀、銅のメダルを割り当て、(M)LLMとヒトの競技者を直接比較できるようにする。大規模な30の最先端(M)LLMの評価では、13の試験において、オープンソースMLLMはほとんどがブロンズレベル以下であり、オープンソースMLLMは時折有望な進歩を示し、クローズソースMLLMは6から12の金メダルを獲得でき、ほとんどのモデルは依然としてフルマークと大きな差があることを示している。これらの結果は、オープンソースモデルとトップクラスの学生の間での大幅なパフォーマンスギャップ、クローズドソース推論モデルの強力な物理的推論能力、そして、改善の余地がまだ大きいという事実を浮き彫りにしている。 HiPhOは、マルチモーダルな物理推論を前進させるための厳格で、人間に沿った、Olympiadにフォーカスしたベンチマークであり、https://github.com/SciYu/HiPhOで公開されている。

論文の概要: HiPhO: How Far Are (M)LLMs from Humans in the Latest High School Physics Olympiad Benchmark?

関連論文リスト