Fugu-MT 論文翻訳(概要): RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

論文の概要: RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

arxiv url: http://arxiv.org/abs/2605.13542v1
Date: Wed, 13 May 2026 13:52:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.085541
Title: RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation
Title（参考訳）: RealICU:LLMエージェントは長期ICUデータを理解するか?
Authors: Chengzhi Shen, Weixiang Shen, Tobias Susetzky, Chen, Chen, Jun Li, Yuyuan Liu, Xuepeng Zhang, Zhenyu Gong, Daniel Rueckert, Jiazhen Pan,
Abstract要約: RealICUは、実際のICU条件下での大規模言語モデル評価のための、後述のベンチマークである。 94MIC-IV患者の930ウィンドウアノテーションを持つRealICU-Goldと、Oracleによって拡張された11,862ウィンドウを持つRealICU-Scaleの2つのデータセットをリリースする。
参考スコア（独自算出の注目度）: 46.82418087865201
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intensive care units (ICU) generate long, dense and evolving streams of clinical information, where physicians must repeatedly reassess patient states under time pressure, underscoring a clear need for reliable AI decision support. Existing ICU benchmarks typically treat historical clinician actions as ground truth. However, these actions are made under incomplete information and limited temporal context of the underlying patient state, and may therefore be suboptimal, making it difficult to assess the true reasoning capabilities of AI systems. We introduce RealICU, a hindsight-annotated benchmark for evaluating large language models (LLMs) under realistic ICU conditions, where labels are created after senior physicians review the full patient trajectory. We formulate four physician-motivated tasks: assess Patient Status, Acute Problems, Recommended Actions, and Red Flag actions that risk unsafe outcomes. We partition each trajectory with 30-min windows and release two datasets: RealICU-Gold with 930-window annotations from 94 MIMIC-IV patients, and RealICU-Scale with 11,862 windows extended by Oracle, a physician-validated LLM hindsight labeler. Existing LLMs including memory-augmented ones performed poorly on RealICU, exposing two failure modes: a recall-safety tradeoff for clinical recommendations, and an anchoring bias to early interpretations of the patient. We further introduce ICU-Evo to study structured-memory agents that improves long-horizon reasoning but does not fully eliminate safety failures. Together, RealICU provides a clinically grounded testbed for measuring and improving AI sequential decision-support in high-stakes care. Project page: https://chengzhi-leo.github.io/RealICU-Bench/
Abstract（参考訳）: 集中治療ユニット(ICU)は、長期で密度が高く進化する臨床情報のストリームを生成し、医師は時間的プレッシャーの下で患者の状態を再評価し、信頼できるAI意思決定支援の必要性を明確に示す。既存のICUベンチマークは、典型的には歴史的クリニックのアクションを基礎的真実として扱う。しかし、これらの行動は、基礎となる患者の状態の不完全な情報と限られた時間的文脈の下で行われ、従って、亜最適である可能性があるため、AIシステムの真の推論能力を評価することは困難である。本報告では,ICU 条件下での大規模言語モデル (LLM) 評価のための後向きアノテーション付きベンチマークである RealICU について紹介する。患者状況,急性問題,推奨行動,安全でない結果のリスクを負うレッドフラッグアクションの4つの医師動機的タスクを定式化する。私たちは各トラジェクトリを30分ウィンドウで分割し、94MIMIC-IV患者の930ウィンドウアノテーションを持つRealICU-Goldと、医師公認のLCMヒンドシットラベスターであるOracleによって拡張された11,862ウィンドウを持つRealICU-Scaleの2つのデータセットをリリースしました。既存のLLMは、RealICUではうまく機能せず、2つの障害モードが露呈した: 臨床レコメンデーションのためのリコールセーフティトレードオフと、患者の早期解釈に対する偏見である。さらに、ICU-Evoを導入して、長距離推論を改善するが、安全性の欠陥を完全に排除しない構造化メモリエージェントについて検討する。同時に、RealICUは、ハイテイクケアにおけるAIシーケンシャルな意思決定支援の測定と改善のための臨床試験ベッドを提供する。プロジェクトページ:https://chengzhi-leo.github.io/RealICU-Bench/

論文の概要: RealICU: Do LLM Agents Understand Long-Context ICU Data? A Benchmark Beyond Behavior Imitation

関連論文リスト