Fugu-MT 論文翻訳(概要): Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

論文の概要: Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

arxiv url: http://arxiv.org/abs/2606.05970v1
Date: Thu, 04 Jun 2026 10:14:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.716819
Title: Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries
Title（参考訳）: LLMによる構造抽出の感度測定による臨床放電サマリーのプロンプト・モデル・スキーマ選択
Authors: Martin Murin,
Abstract要約: 大規模言語モデルは、臨床自由テキストノートからの構造化抽出にますます利用されている。本研究は、抽出タスクを固定し、一度に1つの選択を変更させることにより、人間に注釈を付さない感度を測定する。クロスプロンプト合意は、ICD成層部分集合上のコーエンのカッパによって測定された。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models are increasingly used for structured extraction from clinical free-text notes, but the sensitivity of their output to upstream configuration choices is less understood than their accuracy on fixed benchmarks. This work measures that sensitivity without human-annotated ground truth, by holding the extraction task fixed and varying one choice at a time. The fixed schema comprises 17 clinical documentation flags on a three-way yes/no/not_documented value set and a 47-tag vocabulary for the primary admission reason. Three prompt variants expressing this schema were each run at two model sizes on MIMIC-IV v3.1 discharge summaries. Cross-prompt agreement was measured by Cohen's kappa on ICD-stratified subsets. A paired same-note comparison isolated the effect of model choice, and a post-hoc collapse of the three-way flags to binary tested the schema's contribution to disagreement. On the three-way flags, the two models reach the same pooled cross-prompt agreement (median kappa 0.69 and 0.68); the larger model raises agreement on some fields and lowers it on others, a redistribution rather than the absence of an effect. Collapsing the schema to binary dissolves most of the cross-prompt disagreement, locating it on the absence-versus-silence distinction rather than on whether the finding is present. On the multi-class admission categorization, changing the model reassigns the dominant tag on close to half of all notes while changing the prompt phrasing reassigns it on roughly one in eight, and the larger model places far less mass on residual catch-all categories (44% to 26%). These patterns indicate a schema-imposed source of disagreement concentrated on the absence-versus-silence axis and a dominance of model over prompt phrasing on multi-class categorization, identified by a reusable methodology for auditing extraction reproducibility on a population-scale deployment.
Abstract（参考訳）: 大規模言語モデルは、臨床自由テキストノートからの構造化抽出にますます使われているが、上流設定の選択に対する出力の感度は、固定されたベンチマークの精度よりも低い。本研究は、抽出タスクを固定し、一度に1つの選択を変更させることにより、人間に注釈を付さない感度を測定する。固定スキーマは、3方向のye/no/not_documented値セットに17の臨床文書フラグと、一次入院理由のための47タグ語彙とを備える。このスキーマを表す3つのプロンプト変種は、それぞれMIMIC-IV v3.1の放電サマリーで2つのモデルサイズで実行された。クロスプロンプト合意は、ICD成層部分集合上のコーエンのカッパによって測定された。ペアの同ノート比較では、モデル選択の効果が分離され、3方向フラグがバイナリに崩壊した後、スキーマが不一致に寄与していることが検証された。 3方向フラグでは、2つのモデルは同じプール式クロスプロンプト合意(中間カッパ0.69と0.68)に達し、より大きなモデルはいくつかのフィールドで合意を掲げ、効果がないというよりも、他のフィールドで再配布する。スキーマをバイナリにまとめることによって、クロスプロンプトの不一致の大部分を解消する。マルチクラスのエントリー分類では、モデルの変更はすべての音符のほぼ半分で支配的なタグを再割り当てし、プロンプトのフレーズの変更はおよそ8分の1で再割り当てし、大きなモデルは残りのキャッチオールカテゴリー(44%から26%)でははるかに質量を減らしている。これらのパターンは, 集団規模の展開において, 抽出再現性を監査するための再利用可能な手法によって同定された, 多クラス分類における即時的な表現よりも, 非可逆なサイレンス軸に集結した不一致源とモデル優位性を示す。

論文の概要: Measuring the sensitivity of LLM-based structured extraction to prompt, model, and schema choices in clinical discharge summaries

関連論文リスト