Fugu-MT 論文翻訳(概要): Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

論文の概要: Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

arxiv url: http://arxiv.org/abs/2604.18786v1
Date: Mon, 20 Apr 2026 19:48:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-22 22:41:49.458335
Title: Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models
Title（参考訳）: 実験か成果か? : 大規模言語モデルにおける科学的可能性を探る
Authors: Seyedali Mohammadi, Manas Gaur, Francis Ferraro,
Abstract要約: 制御された知識条件下での大規模言語モデル(LLM)を評価する。実験および/または結果コンテキストの一部を除去することで、ロバスト性を調査する。実験的なエビデンスがLCMベースの実現可能性評価に有効である場合と、脆弱性を導入する場合を明確にする。
参考スコア（独自算出の注目度）: 17.31622097939325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scientific feasibility assessment asks whether a claim is consistent with established knowledge and whether experimental evidence could support or refute it. We frame feasibility assessment as a diagnostic reasoning task in which, given a hypothesis, a model predicts feasible or infeasible and justifies its decision. We evaluate large language models (LLMs) under controlled knowledge conditions (hypothesis-only, with experiments, with outcomes, or both) and probe robustness by progressively removing portions of the experimental and/or outcome context. Across multiple LLMs and two datasets, providing outcome evidence is generally more reliable than providing experiment descriptions. Outcomes tend to improve accuracy beyond what internal knowledge alone provides, whereas experimental text can be brittle and may degrade performance when the context is incomplete. These findings clarify when experimental evidence benefits LLM-based feasibility assessment and when it introduces fragility.
Abstract（参考訳）: 科学的可能性評価は、主張が確立された知識と一致しているか、実験的な証拠がそれを支持または反証できるかを問うものである。フェーザビリティーアセスメントは、仮説が与えられた場合、モデルが実現可能か不可能かを予測し、その決定を正当化する診断的推論タスクである。制御された知識条件下での大規模言語モデル(LLM)を評価し,実験および/または結果コンテキストの一部を段階的に取り除き,ロバスト性を調査する。複数のLSMと2つのデータセットにまたがって、結果のエビデンスを提供することは、実験記述を提供することよりも一般的に信頼性が高い。アウトカムは内部知識だけで提供されるもの以上の精度を改善する傾向があるが、実験的なテキストは脆く、文脈が不完全であれば性能を低下させる可能性がある。以上の結果から, LLMによる実用性評価が有効である場合と, 脆弱性を導入した場合の問題点が明らかとなった。

論文の概要: Experiments or Outcomes? Probing Scientific Feasibility in Large Language Models

関連論文リスト