Fugu-MT 論文翻訳(概要): Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

論文の概要: Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

arxiv url: http://arxiv.org/abs/2604.26498v1
Date: Wed, 29 Apr 2026 10:01:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-30 15:59:36.346864
Title: Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction
Title（参考訳）: 大規模モデルは創薬に本当に勝つのか?AI駆動型分子特性のモデルスケーリングと活動予測のベンチマーク評価
Authors: Jinjiang Guo,
Abstract要約: 分子特性と活性予測には,コンパクトで特殊なモデルが依然として有効であることを示す。古典的なML、GNN、事前訓練されたシーケンスモデルのパフォーマンスの違いは、しばしば控えめでエンドポイントに依存している。その結果, 予測性能は, 分子表現, 誘導バイアス, データ構造, エンドポイント生物学, 検証プロトコルの整合性に依存することが示唆された。
参考スコア（独自算出の注目度）: 0.152292571922932
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid growth of molecular foundation models and general-purpose large language models has encouraged a scale-centric view of artificial intelligence in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models and task-specific graph neural networks (GNNs). We test this assumption on 22 molecular property and activity endpoints, including public ADMET and Tox21 benchmarks and two internal anti-infective activity datasets. Across 167,056 held-out task--molecule evaluations under structure-similarity-separated five-fold cross-validation (37,756 ADMET, 77,946 Tox21, 49,266 anti-TB and 2,088 antimalaria), classical machine-learning (ML) models such as RF(ECFP4) and ExtraTrees(RDKit descriptors) win ten primary-metric tasks, GNNs such as GIN and Ligandformer win nine, and pretrained molecular sequence models such as MoLFormer and ChemBERTa2 win three. Rule-based SAR reasoning baselines, represented by GPT5.5-SAR and Opus4.7-SAR, do not win under the prespecified primary metrics, although train-fold-derived SAR knowledge provides measurable but uneven gains for SAR reasoning and interpretation. These results indicate that compact, specialized models remain highly effective for molecular property and activity prediction. The performance differences among classical ML, GNN and pretrained sequence models are often modest and endpoint-dependent, whereas larger or more general models do not provide a universal predictive advantage. Large models may still add value for zero-shot reasoning, SAR interpretation and hypothesis generation, but the results suggest that predictive performance depends on the alignment among molecular representation, inductive bias, data regime, endpoint biology and validation protocol.
Abstract（参考訳）: 分子基盤モデルと汎用的な大規模言語モデルの急速な成長は、薬物発見における人工知能のスケール中心的な視点を促進し、より大規模な事前学習モデルでは、コンパクトなケミノフォマティクスモデルとタスク固有のグラフニューラルネットワーク(GNN)に取って代わることが期待されている。この仮定は、ADMETとTox21ベンチマークと2つの内部抗感染活性データセットを含む22の分子特性および活性終端上で検証する。 167,056件のタスク-分子の評価は、構造相似性の5倍のクロスバリデーション(37,756 ADMET, 77,946 Tox21, 49,266 anti-TB, 2,088 antimalaria)、RF(ECFP4)やExtraTrees(RDKit descriptors)のような古典的な機械学習(ML)モデルで10回の一次測定タスクを勝ち取り、GINやLigandformerのようなGNNが9回、MoLFormerやChemBERTa2のような事前訓練された分子シーケンスモデルが3回当選した。 GPT5.5-SARとOpus4.7-SARで表されるルールベースのSAR推論ベースラインは、事前に定義された一次基準の下では勝利しない。これらの結果は、コンパクトで特殊なモデルが分子特性や活動予測に高い効果を保っていることを示唆している。古典的なML、GNN、事前訓練されたシーケンスモデルのパフォーマンスの違いは、しばしば適度でエンドポイントに依存している。大規模モデルは、ゼロショット推論、SAR解釈、仮説生成に価値を加えるかもしれないが、結果は、予測性能が分子表現、帰納バイアス、データ構造、エンドポイント生物学、検証プロトコルの整合性に依存することを示唆している。

論文の概要: Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

関連論文リスト