Fugu-MT 論文翻訳(概要): Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

論文の概要: Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

arxiv url: http://arxiv.org/abs/2603.22767v1
Date: Tue, 24 Mar 2026 03:50:34 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-25 19:53:37.290484
Title: Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases
Title（参考訳）: LLMエージェントは実世界のエビデンスを生成できるか? : 医学データベースにおける観察研究の評価
Authors: Dubai Li, Yuxiang He, Yan Hu, Yu Tian, Jingsong Li,
Abstract要約: 我々は、MIMIC-IVをベースとしたRWE-benchについて、ピアレビューによる観察研究から紹介する。各タスクは対応する研究プロトコルを基準として提供し、エージェントは実際のデータベースで実験を行う必要がある。 162タスク全体では、タスク成功率は低く、最高のエージェントが39.9%、最高のオープンソースモデルが30.4%に達する。
参考スコア（独自算出の注目度）: 17.35673829214932
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Observational studies can yield clinically actionable evidence at scale, but executing them on real-world databases is open-ended and requires coherent decisions across cohort construction, analysis, and reporting. Prior evaluations of LLM agents emphasize isolated steps or single answers, missing the integrity and internal structure of the resulting evidence bundle. To address this gap, we introduce RWE-bench, a benchmark grounded in MIMIC-IV and derived from peer-reviewed observational studies. Each task provides the corresponding study protocol as the reference standard, requiring agents to execute experiments in a real database and iteratively generate tree-structured evidence bundles. We evaluate six LLMs (three open-source, three closed-source) under three agent scaffolds using both question-level correctness and end-to-end task metrics. Across 162 tasks, task success is low: the best agent reaches 39.9%, and the best open-source model reaches 30.4%. Agent scaffolds also matter substantially, causing over 30% variation in performance metrics. Furthermore, we implement an automated cohort evaluation method to rapidly localize errors and identify agent failure modes. Overall, the results highlight persistent limitations in agents' ability to produce end-to-end evidence bundles, and efficient validation remains an important direction for future work. Code and data are available at https://github.com/somewordstoolate/RWE-bench.
Abstract（参考訳）: 観察的研究は、臨床的に実行可能な証拠を大規模に得ることができるが、現実のデータベース上でそれらを実行するには、オープンエンドであり、コホートの構築、分析、報告を横断するコヒーレントな決定が必要である。 LLMエージェントの事前評価では、分離されたステップや単一回答を強調し、結果として得られるエビデンスバンドルの完全性や内部構造を欠いている。このギャップに対処するために、MIMIC-IVに基礎を置くベンチマークRWE-benchを導入し、ピアレビューによる観察研究から導いた。各タスクは、対応する研究プロトコルを基準として提供し、エージェントは実際のデータベースで実験を実行し、反復的に木構造されたエビデンスバンドルを生成する必要がある。問合せレベルの正しさとエンドツーエンドのタスクメトリクスを用いて, エージェントスキャフォールドの6つのLCM(オープンソース3つ, クローズドソース3つ)を評価した。 162タスク全体では、タスク成功率は低く、最高のエージェントが39.9%、最高のオープンソースモデルが30.4%に達する。エージェントの足場も大幅に重要で、パフォーマンス指標の30%以上が変更されている。さらに,エラーを迅速にローカライズし,エージェント故障モードを識別するコホート自動評価手法を実装した。全体としては、エージェントがエンドツーエンドのエビデンスバンドルを生成する能力の持続的な制限を強調しており、効率的な検証は将来の作業にとって重要な方向である。コードとデータはhttps://github.com/somewordstoolate/RWE-bench.comで公開されている。

論文の概要: Can LLM Agents Generate Real-World Evidence? Evaluating Observational Studies in Medical Databases

関連論文リスト