Fugu-MT 論文翻訳(概要): Contemporary AI lacks the imagination to diverge or negate in science

論文の概要: Contemporary AI lacks the imagination to diverge or negate in science

arxiv url: http://arxiv.org/abs/2606.08251v2
Date: Tue, 09 Jun 2026 03:31:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 13:21:50.716221
Title: Contemporary AI lacks the imagination to diverge or negate in science
Title（参考訳）: 現代AIは科学の多様化や否定の想像力に欠ける
Authors: Honglin Bao, Siyang Wu, Xiao Liu, Sida Li, Shiyun Cao, James A. Evans,
Abstract要約: 生物学、医学、化学、社会科学にまたがる121,640人の著者が、自分たちの論文の文脈やパズルから大きな言語モデルが生成されるという考えを判断するために招かれた。 6,749人の科学者が25,139セットのノベルティ、経験的実現可能性、真である可能性、養子縁組の適性について評価した。第二に、社会科学者は生命科学者よりもリスクを容認するが、新奇性よりも自分自身に類似したアイデアと賞の確率に報いる。第3に、コミュニティが現在依存している自動評価は、専門家の判断と弱く一致し、検索の強化と科学者のペルソナは利得の限界のみを推し進める。
参考スコア（独自算出の注目度）: 4.024882945191284
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bold projections that artificial intelligence will accelerate scientific discovery have raced ahead of evidence from working scientists, and the field still lacks large-scale, scientist-in-the-loop tests of these claims. Here we mount the largest such evaluation to date and map what AI cannot yet do for science. We invited authors of 121,640 recent preprints across biology, medicine, chemistry, and the social sciences to judge ideas that large language models (LLMs) generated from the context and puzzles of their own papers. 6,749 scientists returned 25,139 sets of ratings on novelty, empirical feasibility, probability of being true, and favorability of adoption. Three patterns emerge. First, non-reasoning LLMs collapse into a narrow "hivemind" of similar ideas; reasoning models roam a wider hypothesis space, yet no model class spontaneously proposes null hypotheses -- a move humans make more freely. Second, scientists reward ideas that resemble their own and prize probability over novelty, though social scientists tolerate risk more readily than life scientists. Senior social scientists are the harshest critics, and their skepticism is well-earned: LLMs falter most in pluralistic fields like the social sciences that demand context-aware interpretation and evolving theories. Third, automated evaluators on which the community currently relies -- LLM-as-a-judge, artificial metrics, and even state-of-the-art (SOTA) models -- agree only weakly with expert judgment, and retrieval augmentation and scientist persona prompting yield only marginal gains. A Qwen3-14B reward model we post-trained on human ratings captures field taste nuances, beats SOTA models by up to 27%, and closes the gap to the inter-rater consistency of independent peer reviewers. For all the hype, today's scientific AI still represents a collaborator whose imagination, outputs and judgment benefit from human grounding.
Abstract（参考訳）: 人工知能が科学的な発見を加速する、という確固たる予測は、実際に働く科学者の証拠に先んじている。ここでは、これまでで最大の評価を行い、AIが科学にまだできないことをマップします。我々は、生物学、医学、化学、社会科学にまたがる121,640の最近のプレプリントの著者を招いて、自分たちの論文の文脈やパズルから大きな言語モデル(LLM)が生み出されたアイデアを判断した。 6,749人の科学者が25,139セットのノベルティ、経験的実現可能性、真である可能性、養子縁組の適性について評価した。 3つのパターンが現れる。推論モデルはより広い仮説空間を歩き回るが、モデルクラスが自然にnull仮説を提唱することはない。第二に、社会科学者は生命科学者よりもリスクを容認するが、新奇性よりも自分自身に類似したアイデアと賞の確率に報いる。上級社会科学者は最も厳しい批判者であり、その懐疑論はよく理解されている。第三に、コミュニティが現在依存している自動評価 - LLM-as-a-judge、人工メトリクス、さらには最先端(SOTA)モデルさえも - は、専門家の判断に弱く一致し、検索の強化と科学者のペルソナは、利得の限界のみを推し進める。人間の評価でポストトレーニングしたQwen3-14B報酬モデルは、フィールドの味のニュアンスを捉え、SOTAモデルを最大27%上回り、独立したピアレビュアーのラッター間の一貫性とのギャップを埋める。今日の科学AIはいまだに、人間の根拠から想像力、アウトプット、判断の恩恵を受ける協力者を表しています。

論文の概要: Contemporary AI lacks the imagination to diverge or negate in science

関連論文リスト