Fugu-MT 論文翻訳(概要): How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding

論文の概要: How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding

arxiv url: http://arxiv.org/abs/2511.14936v1
Date: Tue, 18 Nov 2025 21:51:04 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 15:51:28.542291
Title: How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding
Title（参考訳）: プライベート臨床言語モデルの訓練方法:ICD-9符号化のためのプライバシ保護パイプラインの比較検討
Authors: Mathieu Dufour, Andrew Duncan,
Abstract要約: 臨床テキストのリスクに敏感な患者情報を暴露する大規模言語モデル。 DP最適化の急速な進歩にもかかわらず、どのプライバシ保護戦略が効果的かは不明だ。 DP学習教師の知識蒸留はDP-SGDとDP合成データトレーニングの両方に優れる。
参考スコア（独自算出の注目度）: 0.33148826359547523
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models trained on clinical text risk exposing sensitive patient information, yet differential privacy (DP) methods often severely degrade the diagnostic accuracy needed for deployment. Despite rapid progress in DP optimisation and text generation, it remains unclear which privacy-preserving strategy actually works best for clinical language tasks. We present the first systematic head-to-head comparison of four training pipelines for automated diagnostic coding from hospital discharge summaries. All pipelines use identical 1B-parameter models and matched privacy budgets to predict ICD-9 codes. At moderate and relaxed privacy budgets ($\varepsilon \in \{4, 6\}$), knowledge distillation from DP-trained teachers outperforms both direct DP-SGD and DP-synthetic data training, recovering up to 63\% of the non-private performance whilst maintaining strong empirical privacy (membership-inference AUC $\approx$ 0.5). These findings expose large differences in the privacy-utility trade-off across architectures and identify knowledge distillation as the most practical route to privacy-preserving clinical NLP.
Abstract（参考訳）: 臨床テキストのリスクに敏感な患者情報を暴露するが、差分プライバシ(DP)法は、デプロイに必要な診断精度を著しく低下させることが多い。 DP最適化とテキスト生成の急速な進歩にもかかわらず、どのプライバシ保護戦略が臨床言語タスクに最適なのかはいまだ不明である。病院の退院サマリーから自動診断コーディングを行うための4つの訓練パイプラインを,初めて体系的に比較した。全てのパイプラインは同一の1Bパラメータモデルを使用し、IDD-9コードを予測するためにプライバシー予算と一致している。 DP-SGDとDP-syntheticデータトレーニングの両方でDP-SGDの知識蒸留は、強い経験的プライバシを維持しつつも、非私的パフォーマンスの最大63%を回復する(メンバーシップ推論AUC $\approx$0.5)。これらの知見は, 建築におけるプライバシ・ユーティリティ・トレードオフの大きな違いを明らかにし, プライバシ保存臨床NLPへの最も実践的なルートとして知識蒸留を同定した。

論文の概要: How to Train Private Clinical Language Models: A Comparative Study of Privacy-Preserving Pipelines for ICD-9 Coding

関連論文リスト