Fugu-MT 論文翻訳(概要): Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

論文の概要: Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

arxiv url: http://arxiv.org/abs/2603.12349v1
Date: Thu, 12 Mar 2026 18:09:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.708858
Title: Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection
Title（参考訳）: Budget-Sensitive Discovery Scoring: AIガイドによる科学的選択を評価するための形式的に検証されたフレームワーク
Authors: Abhinaba Basu, Pavan Chakraborty,
Abstract要約: Budget-Sensitive Discovery Score (BSDS)は、各予算レベルで誤った発見を罰する。 Discovery Quality Score (DQS)は、チェリーピックされた予算でうまく機能することで、プロジェクタがインフレできないような、単一のサマリー統計を提供する。フレームワークは、候補が予算制約と非対称なエラーコストの下で選択される任意の設定に適用されます。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scientific discovery increasingly relies on AI systems to select candidates for expensive experimental validation, yet no principled, budget-aware evaluation framework exists for comparing selection strategies -- a gap intensified by large language models (LLMs), which generate plausible scientific proposals without reliable downstream evaluation. We introduce the Budget-Sensitive Discovery Score (BSDS), a formally verified metric -- 20 theorems machine-checked by the Lean 4 proof assistant -- that jointly penalizes false discoveries (lambda-weighted FDR) and excessive abstention (gamma-weighted coverage gap) at each budget level. Its budget-averaged form, the Discovery Quality Score (DQS), provides a single summary statistic that no proposer can inflate by performing well at a cherry-picked budget. As a case study, we apply BSDS/DQS to: do LLMs add marginal value to an existing ML pipeline for drug discovery candidate selection? We evaluate 39 proposers -- 11 mechanistic variants, 14 zero-shot LLM configurations, and 14 few-shot LLM configurations -- using SMILES representations on MoleculeNet HIV (41,127 compounds, 3.5% active, 1,000 bootstrap replicates) under both random and scaffold splits. Three findings emerge. First, the simple RF-based Greedy-ML proposer achieves the best DQS (-0.046), outperforming all MLP variants and LLM configurations. Second, no LLM surpasses the Greedy-ML baseline under zero-shot or few-shot evaluation on HIV or Tox21, establishing that LLMs provide no marginal value over an existing trained classifier. Third, the proposer hierarchy generalizes across five MoleculeNet benchmarks spanning 0.18%-46.2% prevalence, a non-drug AV safety domain, and a 9x7 grid of penalty parameters (tau >= 0.636, mean tau = 0.863). The framework applies to any setting where candidates are selected under budget constraints and asymmetric error costs.
Abstract（参考訳）: 科学的発見は、高価な実験検証のための候補を選択するために、AIシステムにますます依存している。しかし、選択戦略を比較するための原則化された予算対応評価フレームワークは存在しない。予算-敏感な発見スコア(BSDS)は、Lean 4の証明アシスタントがマシンでチェックした20の定理で、誤った発見(ラムダ重み付きFDR)と過剰な棄権(ガンマ重み付きカバレッジギャップ)を各予算レベルで共同で罰する。予算平均のDQS(Discovery Quality Score)は、チェリーピックされた予算でうまく機能することで、プロジェクタがインフレできないような、単一の要約統計を提供する。ケーススタディでは、BSDS/DQSを次のように適用する。 LLMは、薬物発見候補の選択のために既存のMLパイプラインに限界値を加えますか? モレクルネットHIV(41,127化合物,3.5%活性,1,000ブートストラップ複製)上のSMILES表現を用いて,11の機械的変種,14のゼロショットLDM構成,14の少数ショットLDM構成の39のプロジェクタを評価した。 3つの発見がある。まず、単純なRFベースのGreedy-MLプロジェクタが最高のDQS(-0.046)を達成し、全てのMLP変種とLLM構成を上回ります。第2に、HIVやTox21のゼロショット評価や少数ショット評価の下で、LDMがGreedy-MLベースラインを超えることはない。第3に、提案者階層は、0.18%-46.2%の有病率、非ドラッグ型AV安全ドメイン、および9x7のペナルティパラメータ(タウ=0.636、平均タウ=0.863)にまたがる5つのMoreculeNetベンチマークを一般化する。このフレームワークは、候補が予算制約と非対称なエラーコストの下で選択される任意の設定に適用される。

論文の概要: Budget-Sensitive Discovery Scoring: A Formally Verified Framework for Evaluating AI-Guided Scientific Selection

関連論文リスト