Fugu-MT 論文翻訳(概要): DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

論文の概要: DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

arxiv url: http://arxiv.org/abs/2605.26087v1
Date: Mon, 25 May 2026 17:50:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:20.631369
Title: DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
Title（参考訳）: DiscoverPhysics: アウトオブボックス科学思考のためのLLMのベンチマーク
Authors: Matt L. Wiemann, Lindsay M. Smith, Peter Melchior, Siddharth Mishra-Sharma, Andrew Gordon Wilson, Pavel Izmailov, Carolina Cuesta-Lázaro,
Abstract要約: シミュレーションされた世界の動きの法則をLLMエージェントに求める対話型ベンチマークであるDiscoverPhysicsを紹介する。我々は, 遮蔽・分数パワー重力, 多種結合, 暗黒物質様粒子, 非座標物理学, 時間変化相互作用などによって支配される22の世界を構築した。世界を解決するには、エージェントが情報的実験を設計し、仮説を改訂する必要があるため、このベンチマークは長距離推論を探索する。
参考スコア（独自算出の注目度）: 36.38263429163835
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Frontier LLMs now perform strongly across a wide range of physics evaluations, but it is hard to disentangle genuine reasoning from recall of established science. We introduce DiscoverPhysics, an interactive benchmark that asks a LLM agent to discover the laws of motion of a simulated world whose physics deliberately deviates from our own. We construct 22 worlds governed by, among others, screened and fractional-power gravity, multi-species couplings, hidden dark-matter-like particles, non-coordinate-free physics, and time-varying interactions. Each world is generated on demand by an N-body simulator, for which the agent proposes several rounds of experiments, observes raw trajectory data, and ultimately submits both a natural-language explanation of the world's physics and a Python implementation of the inferred law. Because solving a world requires the agent to design informative experiments and revise its hypotheses, the benchmark probes long-horizon reasoning over an experimental history. We evaluate submissions along two complementary axes: trajectory MSE on held-out particles and an LLM-judged explanation score following an expert-written rubric assessing conceptual understanding of each world. Across eleven frontier models, we find that the strongest agents pass only half of the worlds and consistently fail on those where latent structure must be uncovered. Open-source models lag substantially behind commercial models, both in their ability to design informative experiments and in extracting conclusions from the data. We further find that good predictive accuracy does not guarantee high explanation quality and that conceptual understanding depends on hypothesis refinement through well-chosen experiments.
Abstract（参考訳）: 現在、フロンティアLSMは幅広い物理学的評価において強い性能を発揮しているが、確立された科学の思い出から真の理性を引き離すことは困難である。本研究では, LLMエージェントに対して, 物理がわざと逸脱している模擬世界の運動法則を発見するための, インタラクティブなベンチマークであるDiscoverPhysicsを紹介する。我々は, 遮蔽・分数パワー重力, 多種結合, 暗黒物質様粒子, 非座標物理学, 時間変化相互作用などによって支配される22の世界を構築した。各世界はNボディシミュレータによってオンデマンドで生成され、エージェントは数ラウンドの実験を提案し、生の軌道データを観察し、最終的に世界の物理学の自然言語による説明と推論された法則のPython実装の両方を提出する。世界を解決するには、エージェントが情報的実験を設計し、仮説を改訂する必要があるため、このベンチマークは実験の歴史について長期の推論を調査する。我々は,2つの相補的軸,すなわちホールドアウト粒子の軌道 MSE と,各世界の概念的理解を評価する専門家記述ルーブリックに基づく LLM-judged 説明スコアについて検討した。 11つのフロンティアモデルにまたがって、最強のエージェントは世界のわずか半分を通り、潜伏構造が発見されなければならないものに一貫して失敗する。オープンソースモデルは、情報的実験を設計し、データから結論を抽出する能力の両方において、商用モデルに大きく遅れている。さらに、優れた予測精度は、高い説明品質を保証せず、概念的理解は、良心的な実験によって仮説の洗練に依存することが判明した。

論文の概要: DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

関連論文リスト