Fugu-MT 論文翻訳(概要): Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

論文の概要: Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

arxiv url: http://arxiv.org/abs/2603.03322v1
Date: Tue, 10 Feb 2026 05:47:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.150462
Title: Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery
Title（参考訳）: 大規模言語モデルは新たな知識を導き出すことができるか? 生物学的知識発見のための動的ベンチマーク
Authors: Chaoqun Yang, Xinyu Lin, Shulin Li, Wenjie Wang, Ruihan Guo, Fuli Feng, Tat-Seng Chua,
Abstract要約: DBench-Bioは、AIの生物学的知識発見能力を評価するための、動的で完全に自動化されたベンチマークである。このパイプラインをインスタンス化し、12のバイオメディカルサブドメインをカバーする月次更新ベンチマークを構築します。我々の研究は、AIシステムの新しい知識発見能力を評価するための、最初の動的で自動的なフレームワークを提供する。
参考スコア（独自算出の注目度）: 81.03797680309154
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in Large Language Model (LLM) agents have demonstrated remarkable potential in automatic knowledge discovery. However, rigorously evaluating an AI's capacity for knowledge discovery remains a critical challenge. Existing benchmarks predominantly rely on static datasets, leading to inevitable data contamination where models have likely seen the evaluation knowledge during training. Furthermore, the rapid release cycles of modern LLMs render static benchmarks quickly outdated, failing to assess the ability to discover truly new knowledge. To address these limitations, we propose DBench-Bio, a dynamic and fully automated benchmark designed to evaluate AI's biological knowledge discovery ability. DBench-Bio employs a three-stage pipeline: (1) data acquisition of rigorous, authoritative paper abstracts; (2) QA extraction utilizing LLMs to synthesize scientific hypothesis questions and corresponding discovery answers; and (3) QA filter to ensure quality based on relevance, clarity, and centrality. We instantiate this pipeline to construct a monthly-updated benchmark covering 12 biomedical sub-domains. Extensive evaluations of SOTA models reveal current limitations in discovering new knowledge. Our work provides the first dynamic, automatic framework for assessing the new knowledge discovery capabilities of AI systems, establishing a living, evolving resource for AI research community to catalyze the development of knowledge discovery.
Abstract（参考訳）: 近年のLarge Language Model (LLM) エージェントの進歩は、自動知識発見において顕著な可能性を示している。しかし、知識発見のためのAIの能力の厳格な評価は、依然として重要な課題である。既存のベンチマークは、主に静的なデータセットに依存しており、トレーニング中にモデルが評価知識を見た場合、避けられないデータ汚染につながる。さらに、現代のLLMの急激なリリースサイクルでは、静的ベンチマークは急速に時代遅れになり、真に新しい知識を発見する能力の評価に失敗した。これらの制限に対処するために、AIの生物学的知識発見能力を評価するために設計された動的で完全に自動化されたベンチマークであるDBench-Bioを提案する。そこでDBench-Bio は,(1) 厳密な論文要約データ取得,(2) 科学的仮説問題とそれに対応する発見回答を合成するための LLM を用いたQA抽出,(3) 関連性,明確性,中央性に基づく品質保証のためのQAフィルタを用いて,3段階のパイプラインを構築した。このパイプラインをインスタンス化し、12のバイオメディカルサブドメインをカバーする月次更新ベンチマークを構築します。 SOTAモデルの大規模な評価は、新しい知識を発見する際の現在の限界を明らかにする。我々の研究は、AIシステムの新しい知識発見能力を評価するための、初めての動的で自動的なフレームワークを提供し、知識発見の開発を促進するために、AI研究コミュニティのための生きた、進化したリソースを確立する。

論文の概要: Can Large Language Models Derive New Knowledge? A Dynamic Benchmark for Biological Knowledge Discovery

関連論文リスト