Fugu-MT 論文翻訳(概要): BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

論文の概要: BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

arxiv url: http://arxiv.org/abs/2604.21508v1
Date: Thu, 23 Apr 2026 10:11:56 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.437376
Title: BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature
Title（参考訳）: BioMiner: 文献からのタンパク質-リガンド生物活性データの自動マイニングのためのマルチモーダルシステム
Authors: Jiaxian Yan, Jintao Zhu, Yuhang Yang, Qi Liu, Kai Zhang, Zaixi Zhang, Xukai Liu, Boyan Zhang, Kaiyuan Gao, Jinchuan Xiao, Enhong Chen,
Abstract要約: 生物活性データ抽出のためのマルチモーダル抽出フレームワークであるBioMinerを紹介する。 BioMinerでは、生物活性セマンティクスは直接推論によって推測され、化学構造は化学構造に基づく視覚的セマンティクス推論パラダイムによって解決される。厳密な評価と方法開発のために,500の出版物から得られた16,457の生物活性成分からなるベンチマークを構築した。
参考スコア（独自算出の注目度）: 53.894504720119805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Protein-ligand bioactivity data published in the literature are essential for drug discovery, yet manual curation struggles to keep pace with rapidly growing literature. Automated bioactivity extraction remains challenging because it requires not only interpreting biochemical semantics distributed across text, tables, and figures, but also reconstructing chemically exact ligand structures (e.g., Markush structures). To address this bottleneck, we introduce BioMiner, a multi-modal extraction framework that explicitly separates bioactivity semantic interpretation from ligand structure construction. Within BioMiner, bioactivity semantics are inferred through direct reasoning, while chemical structures are resolved via a chemical-structure-grounded visual semantic reasoning paradigm, in which multi-modal large language models operate on chemically grounded visual representations to infer inter-structure relationships, and exact molecular construction is delegated to domain chemistry tools. For rigorous evaluation and method development, we further establish BioVista, a comprehensive benchmark comprising 16,457 bioactivity entries curated from 500 publications. BioMiner validates its extraction ability and provides a quantitative baseline, achieving an F1 score of 0.32 for bioactivity triplets. BioMiner's practical utility is demonstrated via three applications: (1) extracting 82,262 data from 11,683 papers to build a pre-training database that improves downstream models performance by 3.9%; (2) enabling a human-in-the-loop workflow that doubles the number of high-quality NLRP3 bioactivity data, helping 38.6% improvement over 28 QSAR models and identification of 16 hit candidates with novel scaffolds; and (3) accelerating protein-ligand complex bioactivity annotation, achieving a 5.59-fold speed increase and 5.75% accuracy improvement over manual workflows in PoseBusters dataset.
Abstract（参考訳）: この文献で公表されたタンパク質リガンド生物活性データは、薬物発見には不可欠であるが、手作業によるキュレーションは、急速に成長する文献のペースを維持するのに苦労している。自動生物活性抽出は、テキスト、表、図形に分散する生化学的意味論を解釈するだけでなく、化学的に正確なリガンド構造(例えばマルコシュ構造)を再構築する必要があるため、依然として困難である。このボトルネックに対処するために,生物活性の意味的解釈をリガンド構造から明確に分離するマルチモーダル抽出フレームワークであるBioMinerを紹介した。 BioMiner内では、生物活性のセマンティクスは直接推論によって推測され、化学構造は化学構造に基づく視覚的セマンティクスの推論パラダイムによって解決される。厳密な評価と方法開発のために,500の出版物から収集した16,457の生物活性成分からなる総合的なベンチマークであるBioVistaを更に確立する。 BioMinerはその抽出能力を検証し、定量ベースラインを提供し、生物活性三つ子に対するF1スコア0.32を達成する。 1)11,683枚の論文から82,262個のデータを抽出して、ダウンストリームモデルのパフォーマンスを3.9%向上させる事前トレーニングデータベースの構築、(2)高品質なNLRP3バイオアクティビティデータを2倍にし、28のQSARモデルの38.6%の改善と16のヒット候補の識別、(3)タンパク質リガンド複合バイオアクティビティアノテーションの加速、5.59倍のスピード向上と5.75%の精度向上を実現する。

論文の概要: BioMiner: A Multi-modal System for Automated Mining of Protein-Ligand Bioactivity Data from Literature

関連論文リスト