Fugu-MT 論文翻訳(概要): Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

論文の概要: Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

arxiv url: http://arxiv.org/abs/2605.11221v1
Date: Mon, 11 May 2026 20:33:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-13 21:48:56.415359
Title: Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows
Title（参考訳）: 手動キュレーションを超えて:エージェント文学抽出ワークフローによるターゲットタンパク質分解データベースの拡張
Authors: Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, Rocío Mercado,
Abstract要約: バイオメディシンの予測モデルは、一次出版物のテキスト、表、サプリメントにロックされた構造化されたアッセイデータに依存する。このボトルネックは、特にターゲットタンパク質分解(TPD)において、各アッセイレコードは、複合アイデンティティ、分解ターゲット、リクルーザー、アッセイコンテキスト、セクション、テーブル、補足ファイル間で報告されるエンドポイント値を組み合わせる必要がある。ドメイン固有のキュレーションタスクとしてPDデータベース抽出を定式化し、ループ内LPMワークフローを提案する。我々は、TPDデータキュレーションとAI支援科学キュレーションのためのリソースとして、ワークフロー、プロンプト、評価コード、抽出データセットをより広範囲にリリースする。
参考スコア（独自算出の注目度）: 4.363171757159274
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline records, and expert-annotated ground truth. A lightweight cross-validated prompt-refinement module adapts extraction instructions from scarce expert annotations. With only seven annotated molecular glue publications, the workflow achieved record-level $F_1 = 0.98$ and transferred to PROTACs by terminology substitution alone, maintaining record-level $F_1 > 0.93$. Applied at scale, it expanded molecular glue and PROTAC databases by 81% and 92% records, respectively, with 92% and 82.5% of newly recovered records validated as correct upon expert review. The workflow also recovered kinetic and assay-context information essential for cross-study potency comparison and condition-aware degradation modeling. We release the workflow, prompts, evaluation code, and extracted datasets as resources for TPD data curation and AI-assisted scientific curation more broadly.
Abstract（参考訳）: バイオメディシンの予測モデルは、一次出版物のテキスト、表、サプリメントにロックされた構造化されたアッセイデータに依存する。このボトルネックは、特にターゲットタンパク質分解(TPD)において、各アッセイレコードは、複合アイデンティティ、分解ターゲット、リクルーザー、アッセイコンテキスト、セクション、テーブル、補足ファイル間で報告されるエンドポイント値を組み合わせる必要がある。不整合複合識別子と不完全または暗黙のアッセイコンテキストは、ジェネリックLLMパイプラインが提供しないドメイン固有のロジックをさらに要求する。既存の分子接着剤と PROTAC データベースは手動でキュレートされ、下流のモデリングに必要な実験的なコンテキストを欠いていることが多い。我々は、ドメイン固有のキュレーションタスクとしてPDデータベース抽出を定式化し、LLM予測、標準化されたベースラインレコード、および専門家が注釈付けした基底真理の三角形比較により評価した、エキスパート・イン・ザ・ループのLCMワークフローを示す。軽量なクロスバリデーションプロンプトリファインメントモジュールは、不足する専門家アノテーションからの抽出命令に適応する。 7つのアノテートされた分子接着剤の出版物だけで、このワークフローは記録レベルの$F_1 = 0.98$を達成し、用語置換だけでPropertaCsに移行し、記録レベルの$F_1 > 0.93$を維持した。大規模に応用され、分子接着剤とPRTACデータベースをそれぞれ81%と92%に拡大し、専門家のレビューでは92%と82.5%が修正された。ワークフローはまた、クロスステディな機能比較と条件対応の劣化モデリングに不可欠な、速度論的およびアッセイ・コンテキスト情報を回復した。我々は、TPDデータキュレーションとAI支援科学キュレーションのためのリソースとして、ワークフロー、プロンプト、評価コード、抽出データセットをより広範囲にリリースする。

論文の概要: Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

関連論文リスト