Fugu-MT 論文翻訳(概要): Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

論文の概要: Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

arxiv url: http://arxiv.org/abs/2509.25559v1
Date: Mon, 29 Sep 2025 22:31:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 17:09:04.352604
Title: Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology
Title（参考訳）: 放射線学最後の成果(RadLE):放射線学における人的専門家に対するフロンティア・マルチモーダルAIのベンチマークと視覚的推論誤差の分類
Authors: Suvrankar Datta, Divya Buchireddygari, Lakshmi Vennela Chowdary Kaza, Mrudula Bhalke, Kautik Singh, Ayush Pandey, Sonit Sai Vasipalli, Upasana Karnwal, Hakikat Bir Singh Bhatti, Bhavya Ratan Maroo, Sanjana Hebbar, Rahul Joseph, Gurkawal Kaur, Devyani Singh, Akhil V, Dheeksha Devasya Shama Prasad, Nishtha Mahajan, Ayinaparthi Arisha, Rajesh Vanagundi, Reet Nandy, Kartik Vuthoo, Snigdhaa Rajvanshi, Nikhileswar Kondaveeti, Suyash Gunjal, Rishabh Jain, Rajat Jain, Anurag Agrawal,
Abstract要約: 大規模言語モデル(LLM)や視覚言語モデル(VLM)といった一般的なマルチモーダルAIシステムは、臨床医や患者からもアクセスされるようになっている。複数の画像モダリティにまたがる50の専門レベルの「スポット診断」のベンチマークを作成した。我々は,フロンティアAIモデルの性能を,ボード認定放射線科医や放射線学研修生に対して評価した。
参考スコア（独自算出の注目度）: 2.626353375402704
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Generalist multimodal AI systems such as large language models (LLMs) and vision language models (VLMs) are increasingly accessed by clinicians and patients alike for medical image interpretation through widely available consumer-facing chatbots. Most evaluations claiming expert level performance are on public datasets containing common pathologies. Rigorous evaluation of frontier models on difficult diagnostic cases remains limited. We developed a pilot benchmark of 50 expert-level "spot diagnosis" cases across multiple imaging modalities to evaluate the performance of frontier AI models against board-certified radiologists and radiology trainees. To mirror real-world usage, the reasoning modes of five popular frontier AI models were tested through their native web interfaces, viz. OpenAI o3, OpenAI GPT-5, Gemini 2.5 Pro, Grok-4, and Claude Opus 4.1. Accuracy was scored by blinded experts, and reproducibility was assessed across three independent runs. GPT-5 was additionally evaluated across various reasoning modes. Reasoning quality errors were assessed and a taxonomy of visual reasoning errors was defined. Board-certified radiologists achieved the highest diagnostic accuracy (83%), outperforming trainees (45%) and all AI models (best performance shown by GPT-5: 30%). Reliability was substantial for GPT-5 and o3, moderate for Gemini 2.5 Pro and Grok-4, and poor for Claude Opus 4.1. These findings demonstrate that advanced frontier models fall far short of radiologists in challenging diagnostic cases. Our benchmark highlights the present limitations of generalist AI in medical imaging and cautions against unsupervised clinical use. We also provide a qualitative analysis of reasoning traces and propose a practical taxonomy of visual reasoning errors by AI models for better understanding their failure modes, informing evaluation standards and guiding more robust model development.
Abstract（参考訳）: 大規模言語モデル(LLMs)や視覚言語モデル(VLMs)といった汎用的マルチモーダルAIシステムは、医用画像解釈のために広く利用可能なコンシューマ向けチャットボットを通じて、臨床医や患者からアクセスされることが増えている。専門家レベルのパフォーマンスを主張するほとんどの評価は、共通の病理を含む公開データセットに基づいている。難診断症例におけるフロンティアモデルの厳密な評価は依然として限られている。我々は、複数の画像モダリティにまたがる50の専門レベルの「スポット診断」の試験的ベンチマークを開発し、検診医や放射線学研修生に対するフロンティアAIモデルの性能評価を行った。現実世界の使用を反映するため、5つの人気のあるフロンティアAIモデルの推論モードは、ネイティブなWebインターフェースであるvizを通じてテストされた。 OpenAI O3、OpenAI GPT-5、Gemini 2.5 Pro、Grok-4、Claude Opus 4.1。精度は盲目の専門家によって評価され、再現性は3つの独立したランで評価された。 GPT-5は様々な推論モードで評価された。品質エラーを評価し,視覚的推論誤りの分類を定義した。放射線技師は、診断精度が最も高く(83%)、訓練生(45%)、全AIモデル(GPT-5:30%)を上回りました。 GPT-5とo3は信頼性、Gemini 2.5 ProとGrok-4は適度、Claude Opus 4.1は信頼性に乏しかった。これらの結果から, 高度なフロンティアモデルでは, 診断に難渋する症例では, 放射線科医には程遠いことが示唆された。我々のベンチマークでは、医用画像におけるジェネラリストAIの現在の限界と、教師なし臨床使用に対する警告を強調している。また、推論トレースの質的分析を行い、AIモデルによる視覚的推論エラーの実践的分類を提案し、その失敗モードをよりよく理解し、評価基準を通知し、より堅牢なモデル開発を導く。

論文の概要: Radiology's Last Exam (RadLE): Benchmarking Frontier Multimodal AI Against Human Experts and a Taxonomy of Visual Reasoning Errors in Radiology

関連論文リスト