Fugu-MT 論文翻訳(概要): Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

論文の概要: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons

arxiv url: http://arxiv.org/abs/2505.23477v1
Date: Thu, 29 May 2025 14:27:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-30 18:14:07.900048
Title: Evaluating the performance and fragility of large language models on the self-assessment for neurological surgeons
Title（参考訳）: 神経外科医の自己評価における大規模言語モデルの性能と脆弱性の評価
Authors: Krithik Vishwanath, Anton Alyakin, Mrigayu Ghosh, Jin Vivian Lee, Daniel Alexander Alber, Karl L. Sangwon, Douglas Kondziolka, Eric Karl Oermann,
Abstract要約: 神経外科医セルフアセスメント(CNS-SANS)の質問は、脳外科の住民がボード検査を書くために広く利用されている。本研究の目的は,脳神経外科の板状質問に対する最先端のLSMの性能評価と,障害文の含意に対する頑健性を評価することである。 28大言語モデルを用いて包括的評価を行った。これらのモデルは、CNS-SANSから導かれた2,904の脳神経外科検査で試験された。
参考スコア（独自算出の注目度）: 0.7587293779231332
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Congress of Neurological Surgeons Self-Assessment for Neurological Surgeons (CNS-SANS) questions are widely used by neurosurgical residents to prepare for written board examinations. Recently, these questions have also served as benchmarks for evaluating large language models' (LLMs) neurosurgical knowledge. This study aims to assess the performance of state-of-the-art LLMs on neurosurgery board-like questions and to evaluate their robustness to the inclusion of distractor statements. A comprehensive evaluation was conducted using 28 large language models. These models were tested on 2,904 neurosurgery board examination questions derived from the CNS-SANS. Additionally, the study introduced a distraction framework to assess the fragility of these models. The framework incorporated simple, irrelevant distractor statements containing polysemous words with clinical meanings used in non-clinical contexts to determine the extent to which such distractions degrade model performance on standard medical benchmarks. 6 of the 28 tested LLMs achieved board-passing outcomes, with the top-performing models scoring over 15.7% above the passing threshold. When exposed to distractions, accuracy across various model architectures was significantly reduced-by as much as 20.4%-with one model failing that had previously passed. Both general-purpose and medical open-source models experienced greater performance declines compared to proprietary variants when subjected to the added distractors. While current LLMs demonstrate an impressive ability to answer neurosurgery board-like exam questions, their performance is markedly vulnerable to extraneous, distracting information. These findings underscore the critical need for developing novel mitigation strategies aimed at bolstering LLM resilience against in-text distractions, particularly for safe and effective clinical deployment.
Abstract（参考訳）: 神経外科医セルフアセスメント(CNS-SANS)の質問は、脳外科の住民がボード検査を書くために広く使われている。近年,これらの質問は,大規模言語モデル(LLM)の神経外科的知識を評価するためのベンチマークとしても機能している。本研究の目的は,脳神経外科の板状質問に対する最先端のLSMの性能評価と,障害文の含意に対する頑健性を評価することである。 28大言語モデルを用いて包括的評価を行った。これらのモデルは、CNS-SANSから導かれた2,904の脳神経外科検査で試験された。さらに、これらのモデルの脆弱性を評価するために、この研究は気晴らしの枠組みを導入した。このフレームワークは、非クリニカルな文脈で使われる臨床的意味を持つ多文単語を含む単純で無関係な気晴らし文を組み込んで、そのような気晴らしが標準医学ベンチマークでモデルパフォーマンスを低下させる程度を判断した。 28基のLSMのうち6基はボードパスの結果を達成し、最高性能のモデルはパスしきい値より15.7%以上高いスコアを得た。気を散らすと、様々なモデルアーキテクチャの精度が20.4%まで大幅に低下し、1つのモデルが失敗した。汎用と医療用の両方のオープンソースモデルは、追加のイントラクタに従うと、プロプライエタリな派生モデルに比べてパフォーマンスが低下した。現在のLSMは、脳神経外科のような検査問題に答える素晴らしい能力を示していますが、そのパフォーマンスは異常な情報に対して著しく脆弱です。これらの知見は、特に安全かつ効果的な臨床展開において、LLMレジリエンスのテキスト内乱れに対する強化を目的とした、新しい緩和戦略を開発するための重要な必要性を浮き彫りにした。

関連論文リスト

Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement Techniques and Applications [59.721265428780946]
医学における大きな言語モデル(LLM)は印象的な能力を実現しているが、体系的で透明で検証可能な推論を行う能力に重大なギャップが残っている。本稿は、この新興分野に関する最初の体系的なレビューを提供する。本稿では,学習時間戦略とテスト時間メカニズムに分類した推論強化手法の分類法を提案する。
論文参考訳（メタデータ） (2025-08-01T14:41:31Z)
Naturalistic Language-related Movie-Watching fMRI Task for Detecting Neurocognitive Decline and Disorder [60.84344168388442]
言語関連機能的磁気共鳴画像(fMRI)は,認知機能低下と早期NCDの検出に有望なアプローチである。香港在住の高齢者97名を対象に,この課題の有効性について検討した。本研究は、加齢に伴う認知低下とNCDの早期発見のための自然言語関連fMRIタスクの可能性を示した。
論文参考訳（メタデータ） (2025-06-10T16:58:47Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
MedR-Benchは1,453例の構造化患者のベンチマークデータセットで、推論基準を付した注釈付きである。本稿では,3つの批判的診察勧告,診断決定,治療計画を含む枠組みを提案し,患者のケアジャーニー全体をシミュレートする。このベンチマークを用いて、DeepSeek-R1、OpenAI-o3-mini、Gemini-2.0-Flash Thinkingなど、最先端の5つのLCMを評価した。
論文参考訳（メタデータ） (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
大規模言語モデル(LLM)は、しばしばオープンエンドの医学的問題に苦しむ。本稿では,構造化医療推論を利用した新しいアプローチを提案する。我々の手法は85.8のファクチュアリティスコアを達成し、微調整されたモデルを上回る。
論文参考訳（メタデータ） (2025-03-05T05:24:55Z)
Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning [3.3482359447109866]
LLM(Large Language Models)は、医療質問応答(QA)ベンチマークにおいて人間レベルの精度を達成した。オープンエンドの臨床シナリオをナビゲートする際の制限が最近示されている。医学的抽象化と推論コーパス(M-ARC)について紹介する。現状のo1モデルやGeminiモデルを含むLSMは,M-ARCの医師と比較して性能が劣ることがわかった。
論文参考訳（メタデータ） (2025-02-05T18:14:27Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
LlaMADRSは、オープンソースのLarge Language Models(LLM)を利用して、うつ病の重症度評価を自動化する新しいフレームワークである。本研究は,クリニカルインタヴューの解釈・スコアリングにおけるモデル指導のために,慎重に設計された手がかりを用いたゼロショットプロンプト戦略を用いている。実世界における236件のインタビューを対象とし,臨床評価と強い相関性を示した。
論文参考訳（メタデータ） (2025-01-07T08:49:04Z)
SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
臨床知識を符号化するための言語モデル(LLM)が示されている。 6つの最先端モデルをベンチマークする評価フレームワークであるSemioLLMを提案する。ほとんどのLSMは、脳内の発作発生領域の確率的予測を正確かつ確実に生成できることを示す。
論文参考訳（メタデータ） (2024-07-03T11:02:12Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
大規模言語モデル(LLM)は、医療領域において大きな可能性を示す。これらの質問は、USMLEのような試験をモデルとしたMCQ(Multiple-choice Question)を用いて評価されることが多い。私たちは、想像上のオルガンであるGlianorexを中心とした架空の医療ベンチマークを作成し、記憶された知識と推論能力の分離を可能にしました。
論文参考訳（メタデータ） (2024-06-04T15:08:56Z)
Almanac: Retrieval-Augmented Language Models for Clinical Medicine [1.5505279143287174]
医療ガイドラインと治療勧告の検索機能を備えた大規模言語モデルフレームワークであるAlmanacを開発した。 5人の医師と医師のパネルで評価された新しい臨床シナリオのデータセットの性能は、事実性の顕著な増加を示している。
論文参考訳（メタデータ） (2023-03-01T02:30:11Z)
What Do You See in this Patient? Behavioral Testing of Clinical NLP Models [69.09570726777817]
本稿では,入力の変化に関する臨床結果モデルの振る舞いを評価する拡張可能なテストフレームワークを提案する。私たちは、同じデータを微調整しても、モデル行動は劇的に変化し、最高のパフォーマンスのモデルが常に最も医学的に可能なパターンを学習していないことを示しています。
論文参考訳（メタデータ） (2021-11-30T15:52:04Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。