Fugu-MT 論文翻訳(概要): Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

論文の概要: Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

arxiv url: http://arxiv.org/abs/2605.07022v1
Date: Thu, 07 May 2026 23:08:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-11 19:43:38.66519
Title: Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Title（参考訳）: 自動運転データセット:2000万件の論文から大規模バイオメディカル・ナレッジまで
Authors: Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner,
Abstract要約: PubMedは、より大きく、よりニュアンスが高く、より正確に構造化されたデータセットに、自律的で費用効率良く変換できることを示す。本報告では,(1) バイオメディカルレポジトリを基盤としたエンティティタグパイプライン,(2) エンティティターゲットコーパスレポジトリを支援するハイブリッド検索,(3) 自然言語記述のみを付与したディープリサーチシステムであるStarlingの3つの貢献について述べる。
参考スコア（独自算出の注目度）: 34.468123235616524
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted retrieval filters, induces an extraction schema, and emits structured records with nuance-rich fields and supporting passages. Across six tasks -- blood-brain barrier permeability, oral bioavailability, acute toxicity (LD50), gene-disease associations, protein subcellular localization, and chemical reactions -- Starling produces ~6.3M records (91K-3M per task); several are, to our knowledge, the largest public datasets for their property. Frontier-model rejection of our extractions is 0.6-7.7% across tasks, far below error rates we measure on widely used curated counterparts (e.g., 16.5% on BBB_Martins, 7.3% on Bioavailability_Ma). Beyond scale and accuracy, the supporting passages carry nuance tabular databases discard -- e.g., oral bioavailability may depend on fed vs. fasted state. Together, the corpus, retrieval, and agent establish a foundation for AI-driven therapeutic design. Code and datasets: https://github.com/starling-labs/starling.
Abstract（参考訳）: 手作業でキュレートされたバイオメディカルレポジトリ(生物活性、ゲノム学、化学)は、維持に費用がかかり、一次文献に遅れ、実験的な文脈を捨て、データの正しさとカバレッジを評価するのに必要なニュアンスを無視する。 PubMed自体が、より大きく、よりニュアンスがあり、彼らが置き換えたキュレートされたデータベースよりも正確である構造化データセットに、自律的かつ費用対効果で変換できることを示します。 LLMに基づくエンティティタグ付けパイプラインを9つの生物医学的オントロジーで構築し, 2.5T-token PubMed corpusの19のカテゴリに4.5Bのエンティティをタグ付けし, 2.5T-token PubMed corpus, (2) タグ付きコーパス上でのエンティティをフィルタリングするハイブリッドスパースセンス検索, (3) 自然言語タスク記述のみを付与するマルチエージェントディープリサーチシステム, 設計精度とリコールターゲット検索フィルタ, 抽出スキーマを誘導し, 構造化されたデータを生成する。血液脳関門透過性、口腔バイオアベイラビリティ、急性毒性(LD50)、遺伝子分解関連、タンパク質細胞内局在、化学反応の6つのタスクで、スターリングはおよそ6.3Mレコード(タスク当たり91K-3M)を生成している。抽出のフロンティアモデル拒絶はタスク間で0.6-7.7%であり、広く使用されているキュレートのエラー率よりもはるかに低い(例えば、BBB_Martinsでは16.5%、Bioavailability_Maでは7.3%)。スケールと正確性以外にも、サポートパスは、ヌアンスタブ形式のデータベースを捨てる -- 例えば、経口的バイオアベイラビリティは、供給された状態と高速な状態に依存する可能性がある。コーパス、検索、エージェントはAIによる治療設計の基礎を確立する。コードとデータセット:https://github.com/starling-labs/starling。

論文の概要: Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

関連論文リスト