Fugu-MT 論文翻訳(概要): Clinician input steers frontier AI models toward both accurate and harmful decisions

論文の概要: Clinician input steers frontier AI models toward both accurate and harmful decisions

arxiv url: http://arxiv.org/abs/2603.14158v1
Date: Sat, 14 Mar 2026 23:47:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.641763
Title: Clinician input steers frontier AI models toward both accurate and harmful decisions
Title（参考訳）: 臨床入力ステアフロンティアAIモデルによる正確かつ有害な意思決定
Authors: Ivan Lopez, Selin S. Everett, Bryan J. Bunning, April S. Liang, Dong Han Yao, Shivam C. Vedak, Kameron C. Black, Sophie Ostmeier, Stephen P. Ma, Emily Alsentzer, Jonathan H. Chen, Akshay S. Chaudhari, Eric Horvitz,
Abstract要約: 8つのフロンティアモデルにまたがる21の言語モデル (LLM) を, 差分診断生成と次のステップ勧告に基づいて評価した。専門的な文脈は、21モデル全体にわたる正しい最終診断の包含を著しく改善した。 GPT-4o 実験では, 臨床症状の明確な不確実性信号により, 対側的文脈での診断性能が向上した。
参考スコア（独自算出の注目度）: 10.599240857217811
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are entering clinician workflows, yet evaluations rarely measure how clinician reasoning shapes model behavior during clinical interactions. We combined 61 New England Journal of Medicine Case Records with 92 real-world clinician-AI interactions to evaluate 21 reasoning LLM variants across 8 frontier models on differential diagnosis generation and next step recommendations under three conditions: reasoning alone, after expert clinician context, and after adversarial clinician context. LLM-clinician concordance increased substantially after clinician exposure, with simulations sharing >=3 differential diagnosis items rising from 65.8% to 93.5% and >=3 next step recommendations from 20.3% to 53.8%. Expert context significantly improved correct final diagnosis inclusion across all 21 models (mean +20.4 percentage points), reflecting both reasoning improvement and passive content echoing, while adversarial context caused significant diagnostic degradation in 14 models (mean -5.4 percentage points). Multi-turn disagreement probes revealed distinct model phenotypes ranging from highly conformist to dogmatic, with adversarial arguments remaining a persistent vulnerability even for otherwise resilient models. Inference-time scaling reduced harmful echoing of clinician-introduced recommendations across WHO-defined harm severity tiers (relative reductions: 62.7% mild, 57.9% moderate, 76.3% severe, 83.5% death-tier). In GPT-4o experiments, explicit clinician uncertainty signals improved diagnostic performance after adversarial context (final diagnosis inclusion 27% to 42%) and reduced alignment with incorrect arguments by 21%. These findings establish a foundation for evaluating clinician-AI collaboration, introducing interactive metrics and mitigation strategies essential for safety and robustness.
Abstract（参考訳）: 大規模言語モデル(LLM)は臨床のワークフローに導入されているが、臨床間の相互作用において、臨床の推論がモデル行動をどのように形作るかを評価することは滅多にない。 61のニューイングランド・ジャーナル・オブ・メディカル・ケース・レコーズと92のリアル・ワールド・クリニック・AIインタラクションを併用し,8つのフロンティアモデルにまたがるLLM変異の推論と次のステップレコメンデーションを3つの条件下で評価した。 LLM-clinician concordance(LLM-clinician concordance, LLM-clinician concordance, LLM-clinician Concordance, LLM-clinician Concordance, LLM-clinician Concordance, LLM-clinician Concordance)は, 65.8%から93.5%, >==3の次ステップレコメンデーションが20.3%から53.8%に増加した。専門家の文脈は、すべての21モデル(平均+20.4ポイント)の正確な最終診断を著しく改善し、推論の改善と受動的内容のエコーの両方を反映した。マルチターン不一致プローブは、高い適合性からドクトマティクスまで、異なるモデル表現型を示し、逆説は弾力性のあるモデルであっても永続的な脆弱性を保ったままである。推論時間のスケーリングは、WHOが定義した有害度レベル(62.7%軽度、57.9%中等度、76.3%重度、83.5%死亡率)で臨床医が導入した勧告の有害なエコーを減らした。 GPT-4o実験では, 反対文脈(最終診断包含率27%から42%)後の診断性能を向上し, 誤った議論との整合性を21%低減した。これらの知見は,安全性とロバスト性に不可欠な対話的指標と緩和戦略を導入し,臨床とAIの連携を評価する基盤を確立した。

論文の概要: Clinician input steers frontier AI models toward both accurate and harmful decisions

関連論文リスト