Fugu-MT 論文翻訳(概要): Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

論文の概要: Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

arxiv url: http://arxiv.org/abs/2510.10278v1
Date: Sat, 11 Oct 2025 16:24:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.872335
Title: Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models
Title（参考訳）: 大規模言語モデルにおける臨床推論のためのバイバボイス検査のシミュレーション
Authors: Christopher Chiu, Silviu Pitis, Mihaela van der Schaar,
Abstract要約: 大規模言語モデル(LLM)におけるシーケンシャルな臨床推論を評価するためのベンチマークであるVivaBenchを紹介する。本データセットは,医療訓練における(口頭)検査をシミュレートする対話的シナリオとして構成された1762名の医師による臨床ヴィグネットから構成される。本分析では,臨床における認知的誤りを反映するいくつかの障害モードを同定した。
参考スコア（独自算出の注目度）: 51.91760712805404
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Clinical reasoning in medicine is a hypothesis-driven process where physicians refine diagnoses from limited information through targeted history, physical examination, and diagnostic investigations. In contrast, current medical benchmarks for large language models (LLMs) primarily assess knowledge recall through single-turn questions, where complete clinical information is provided upfront. To address this gap, we introduce VivaBench, a multi-turn benchmark that evaluates sequential clinical reasoning in LLM agents. Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training, requiring agents to actively probe for relevant findings, select appropriate investigations, and synthesize information across multiple steps to reach a diagnosis. While current LLMs demonstrate competence in diagnosing conditions from well-described clinical presentations, their performance degrades significantly when required to navigate iterative diagnostic reasoning under uncertainty in our evaluation. Our analysis identified several failure modes that mirror common cognitive errors in clinical practice, including: (1) fixation on initial hypotheses, (2) inappropriate investigation ordering, (3) premature diagnostic closure, and (4) failing to screen for critical conditions. These patterns reveal fundamental limitations in how current LLMs reason and make decisions under uncertainty. Through VivaBench, we provide a standardized benchmark for evaluating conversational medical AI systems for real-world clinical decision support. Beyond medical applications, we contribute to the larger corpus of research on agentic AI by demonstrating how sequential reasoning trajectories can diverge in complex decision-making environments.
Abstract（参考訳）: 医学における臨床推論は、医師が対象とする歴史、身体検査、診断調査を通じて限られた情報から診断を洗練させる仮説駆動のプロセスである。対照的に、大規模言語モデル(LLMs)の現在の医療ベンチマークは、主に単ターン質問による知識リコールを評価し、完全な臨床情報が事前に提供される。このギャップに対処するために、LLMエージェントの逐次的臨床推論を評価するマルチターンベンチマークであるVivaBenchを紹介する。このデータセットは、1762人の医師が編集した臨床ヴィグネットで構成されており、医療訓練における(口頭)検査をシミュレートする対話的なシナリオとして構成されており、エージェントは、関連する発見を積極的に調査し、適切な調査を選択し、診断に到達するために複数のステップにまたがる情報を合成する必要がある。現状のLCMは, 臨床診断から診断する能力を示すが, 評価の不確実性の下で反復的診断を行うために必要な場合, 性能は著しく低下する。本分析では,(1)初期仮説の修正,(2)不適切な調査命令,(3)早期診断閉鎖,(4)重症度検査の欠如など,臨床実践における認知異常を反映するいくつかの障害モードを同定した。これらのパターンは、現在のLCMの理由と不確実性の下での意思決定における根本的な制限を明らかにします。 VivaBenchを通じて、現実の臨床的意思決定支援のための会話型医療AIシステムを評価するための標準化されたベンチマークを提供する。医療応用以外にも、複雑な意思決定環境において、シーケンシャルな推論軌跡がいかに分散するかを示すことによって、エージェントAIに関するより大きな研究のコーパスに貢献する。

論文の概要: Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models

関連論文リスト