Fugu-MT 論文翻訳(概要): Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

論文の概要: Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

arxiv url: http://arxiv.org/abs/2511.13703v1
Date: Mon, 17 Nov 2025 18:52:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-18 18:52:09.689654
Title: Generalist Foundation Models Are Not Clinical Enough for Hospital Operations
Title（参考訳）: ジェネラル・ファンデーション・モデルは病院手術に十分ではない
Authors: Lavender Y. Jiang, Angelica Chen, Xu Han, Xujin Chris Liu, Radhika Dua, Kevin Eaton, Frederick Wolff, Robert Steele, Jeff Zhang, Anton Alyakin, Qingkai Pan, Yanbing Chen, Karl L. Sangwon, Daniel A. Alber, Jaden Stryker, Jin Vivian Lee, Yindalon Aphinyanaphongs, Kyunghyun Cho, Eric Karl Oermann,
Abstract要約: 我々は、NYU Langone HealthのEHRsと627Bのトークンから80Bのクリニカルトークンを混合した専用コーパスで事前訓練されたモデル群であるLang1を紹介する。実世界の環境でLang1を厳格に評価するために、668,331 EHRの指標であるRealistic Medical Evaluation (ReMedE)を開発した。ラング1-1Bは70倍、0ショットモデルが671倍、AUROCが3.64%-6.75%、1.66%-23.6%向上した。
参考スコア（独自算出の注目度）: 29.539795338917983
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Hospitals and healthcare systems rely on operational decisions that determine patient flow, cost, and quality of care. Despite strong performance on medical knowledge and conversational benchmarks, foundation models trained on general text may lack the specialized knowledge required for these operational decisions. We introduce Lang1, a family of models (100M-7B parameters) pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet. To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes that evaluates five critical tasks: 30-day readmission prediction, 30-day mortality prediction, length of stay, comorbidity coding, and predicting insurance claims denial. In zero-shot settings, both general-purpose and specialized models underperform on four of five tasks (36.6%-71.7% AUROC), with mortality prediction being an exception. After finetuning, Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively. We also observed cross-task scaling with joint finetuning on multiple tasks leading to improvement on other tasks. Lang1-1B effectively transfers to out-of-distribution settings, including other clinical tasks and an external health system. Our findings suggest that predictive capabilities for hospital operations require explicit supervised finetuning, and that this finetuning process is made more efficient by in-domain pretraining on EHR. Our findings support the emerging view that specialized LLMs can compete with generalist models in specialized tasks, and show that effective healthcare systems AI requires the combination of in-domain pretraining, supervised finetuning, and real-world evaluation beyond proxy benchmarks.
Abstract（参考訳）: 病院や医療システムは、患者のフロー、コスト、ケアの質を決定する運用上の決定に依存している。医学的知識と会話的ベンチマークに強い性能があるにもかかわらず、一般的なテキストで訓練された基礎モデルは、これらの運用上の決定に必要な専門知識を欠いている可能性がある。我々は、NYU Langone HealthのEHRと627Bのインターネットトークンから80Bの臨床トークンを混合した特別なコーパスで事前訓練されたモデル(100M-7Bパラメータ)のファミリーであるLang1を紹介する。現実の環境でLang1を厳格に評価するために、668,331 EHRノートから得られたベンチマークであるRealistic Medical Evaluation(ReMedE)を開発した。ゼロショット設定では、汎用モデルと特殊モデルの両方が5つのタスク(36.6%-71.7% AUROC)のうち4つのタスクで実行され、死亡予測は例外である。微調整後、Lang1-1Bは70倍、ゼロショットモデルが671倍、AUROCが3.64%-6.75%、1.66%-23.6%向上した。また,複数タスクの関節ファインタニングによるクロスタスクスケーリングも観察し,他のタスクの改善につながった。 Lang1-1Bは、他の臨床タスクや外部の健康システムを含む、配布外設定に効果的に移行する。以上の結果から,病院手術の予測能力には明示的な監督的微調整が必要であることが示唆され,この微調整プロセスは,EMHのドメイン内事前トレーニングによりより効率的に行われることが示唆された。我々の研究は、特殊なLSMが専門的なタスクにおいてジェネリストモデルと競合できるという新興の見解を支持し、効果的な医療システムAIには、プロキシベンチマークを超えたドメイン内事前トレーニング、教師付き微調整、実世界の評価の組み合わせが必要であることを示す。

論文の概要: Generalist Foundation Models Are Not Clinical Enough for Hospital Operations

関連論文リスト