Fugu-MT 論文翻訳(概要): SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

論文の概要: SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

arxiv url: http://arxiv.org/abs/2604.10718v1
Date: Sun, 12 Apr 2026 16:28:51 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:16.186901
Title: SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?
Title（参考訳）: 科学予測:LLMは自然科学における科学実験の結果を予測することができるか?
Authors: Udari Madhushani Sehwag, Elaine Lau, Haniyeh Ehsani Oskouie, Shayan Shabihi, Erich Liang, Andrea Toledo, Guillermo Mangialardi, Sergio Fonrouge, Ed-Yeremai Hernandez Cardona, Paula Vergara, Utkarsh Tyagi, Chen Bo Calvin Zhang, Pavi Bhatter, Nicholas Johnson, Furong Huang, Ernesto Gabriel Hernandez Montoya, Bing Liu,
Abstract要約: SciPredictは、物理、生物学、化学の33の専門分野における最近の経験的研究から405のタスクからなるベンチマークである。モデルアキュラシーは14～26%、人間の専門家のパフォーマンスは$approx$20%である。対照的に、人間の専門家は強い校正を証明している:それらの精度は、実験を行わずに結果をより予測できると判断し、$approx$5%から$approx$80%へと上昇する。
参考スコア（独自算出の注目度）: 36.56539892571017
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Accelerating scientific discovery requires the identification of which experiments would yield the best outcomes before committing resources to costly physical validation. While existing benchmarks evaluate LLMs on scientific knowledge and reasoning, their ability to predict experimental outcomes - a task where AI could significantly exceed human capabilities - remains largely underexplored. We introduce SciPredict, a benchmark comprising 405 tasks derived from recent empirical studies in 33 specialized sub-fields of physics, biology, and chemistry. SciPredict addresses two critical questions: (a) can LLMs predict the outcome of scientific experiments with sufficient accuracy? and (b) can such predictions be reliably used in the scientific research process? Evaluations reveal fundamental limitations on both fronts. Model accuracies are 14-26% and human expert performance is $\approx$20%. Although some frontier models exceed human performance model accuracy is still far below what would enable reliable experimental guidance. Even within the limited performance, models fail to distinguish reliable predictions from unreliable ones, achieving only $\approx$20% accuracy regardless of their confidence or whether they judge outcomes as predictable without physical experimentation. Human experts, in contrast, demonstrate strong calibration: their accuracy increases from $\approx$5% to $\approx$80% as they deem outcomes more predictable without conducting the experiment. SciPredict establishes a rigorous framework demonstrating that superhuman performance in experimental science requires not just better predictions, but better awareness of prediction reliability. For reproducibility all our data and code are provided at https://github.com/scaleapi/scipredict
Abstract（参考訳）: 科学的な発見を加速するためには、どの実験が最高の結果をもたらすかを特定する必要がある。既存のベンチマークでは、科学的知識と推論についてLLMを評価しているが、AIが人間の能力を大幅に上回る可能性のある、実験結果を予測する能力は、いまだに未熟である。 SciPredictは、物理、生物学、化学の33の専門分野における最近の経験的研究から得られた405のタスクからなるベンチマークである。 SciPredictは2つの重要な疑問に対処する。 a) LLMは十分な精度で科学実験の結果を予測することができるか? そして b) 科学的研究プロセスにおいて、そのような予測を確実に利用できるか。評価は両面の基本的な限界を明らかにしている。モデルアキュラシーは14～26%、人間の専門家のパフォーマンスは$\approx$20%である。一部のフロンティアモデルは人間の性能モデルを超えるが、信頼性の高い実験ガイダンスを実現するための精度は依然としてはるかに低い。限られた性能でも、モデルは信頼性の低い予測と信頼性の低い予測を区別することができず、信頼性に関わらず、または物理的実験なしで結果が予測可能であると判断するかどうかに関わらず、$\approx$20%の精度しか達成できない。対照的に、人間の専門家は強い校正を証明している:その精度は、実験を行わずに結果をより予測できると判断し、$\approx$5%から$\approx$80%へと上昇する。 SciPredictは、実験科学における超人的なパフォーマンスには、より良い予測だけでなく、予測信頼性の認識も必要である、という厳格な枠組みを確立している。再現性のために、我々のデータとコードはhttps://github.com/scaleapi/scipredictで提供されている。

論文の概要: SciPredict: Can LLMs Predict the Outcomes of Scientific Experiments in Natural Sciences?

関連論文リスト