Fugu-MT 論文翻訳(概要): Why Chain of Thought Fails in Clinical Text Understanding

論文の概要: Why Chain of Thought Fails in Clinical Text Understanding

arxiv url: http://arxiv.org/abs/2509.21933v1
Date: Fri, 26 Sep 2025 06:18:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-29 20:57:54.229002
Title: Why Chain of Thought Fails in Clinical Text Understanding
Title（参考訳）: 臨床テキスト理解における思考の連鎖
Authors: Jiageng Wu, Kevin Xie, Bowen Gu, Nils Krüger, Kueiyu Joshua Lin, Jie Yang,
Abstract要約: チェーン・オブ・シークレット・プロンプト(CoT)はステップ・バイ・ステップの推論を導く。大規模言語モデル(LLM)は、臨床医療にますます応用されている。本研究は,臨床テキスト理解のためのCoTの大規模体系化研究である。
参考スコア（独自算出の注目度）: 11.895158827781017
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly being applied to clinical care, a domain where both accuracy and transparent reasoning are critical for safe and trustworthy deployment. Chain-of-thought (CoT) prompting, which elicits step-by-step reasoning, has demonstrated improvements in performance and interpretability across a wide range of tasks. However, its effectiveness in clinical contexts remains largely unexplored, particularly in the context of electronic health records (EHRs), the primary source of clinical documentation, which are often lengthy, fragmented, and noisy. In this work, we present the first large-scale systematic study of CoT for clinical text understanding. We assess 95 advanced LLMs on 87 real-world clinical text tasks, covering 9 languages and 8 task types. Contrary to prior findings in other domains, we observe that 86.3\% of models suffer consistent performance degradation in the CoT setting. More capable models remain relatively robust, while weaker ones suffer substantial declines. To better characterize these effects, we perform fine-grained analyses of reasoning length, medical concept alignment, and error profiles, leveraging both LLM-as-a-judge evaluation and clinical expert evaluation. Our results uncover systematic patterns in when and why CoT fails in clinical contexts, which highlight a critical paradox: CoT enhances interpretability but may undermine reliability in clinical text tasks. This work provides an empirical basis for clinical reasoning strategies of LLMs, highlighting the need for transparent and trustworthy approaches.
Abstract（参考訳）: 大規模言語モデル(LLM)は、安全で信頼性の高いデプロイメントにおいて、正確性と透明な推論の両方が不可欠である領域である臨床医療にますます応用されている。 CoT(Chain-of- Thought)プロンプトは、ステップバイステップの推論を引き出すもので、幅広いタスクにおけるパフォーマンスと解釈性の向上を実証している。しかし、その臨床的文脈における有効性は、特に電子健康記録(EHRs)の文脈において、しばしば長く、断片化され、騒々しい臨床文書の主要な情報源である。本研究は,臨床テキスト理解のためのCoTの大規模体系化研究である。実世界の87のテキストタスクに対して95の高度なLCMを評価し,9つの言語と8つのタスクタイプをカバーした。他の領域での以前の結果とは対照的に、86.3 %のモデルがCoT設定で一貫した性能劣化を被っている。より有能なモデルは比較的堅牢であり、弱いモデルは著しく低下する。これらの効果をより正確に評価するために, LLM-as-a-judge評価と臨床専門家評価を併用して, 推論長, 医療コンセプトアライメント, エラープロファイルのきめ細かい分析を行う。以上の結果から,CoTは解釈可能性を高めつつも,臨床テキストタスクの信頼性を損なう可能性があるという重要なパラドックスが浮かび上がっている。この研究は、LSMの臨床的推論戦略の実証的基盤を提供し、透明で信頼できるアプローチの必要性を強調している。

論文の概要: Why Chain of Thought Fails in Clinical Text Understanding

関連論文リスト