Fugu-MT 論文翻訳(概要): Comparative Analysis of Large Language Models in Healthcare

論文の概要: Comparative Analysis of Large Language Models in Healthcare

arxiv url: http://arxiv.org/abs/2604.10316v1
Date: Sat, 11 Apr 2026 18:47:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-14 20:13:15.952528
Title: Comparative Analysis of Large Language Models in Healthcare
Title（参考訳）: 医療における大規模言語モデルの比較分析
Authors: Subin Santhosh, Farwa Abbas, Hussain Ahmad, Claudia Szabo,
Abstract要約: 大規模言語モデル(LLM)は、医療における人工知能の応用を変革している。高度な臨床環境への展開は、正確性、信頼性、患者の安全性に関する重要な懸念を提起する。本研究は,医療現場におけるLCMの標準化された比較評価の必要性に対処するものである。
参考スコア（独自算出の注目度）: 1.9704270315085601
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background: Large Language Models (LLMs) are transforming artificial intelligence applications in healthcare due to their ability to understand, generate, and summarize complex medical text. They offer valuable support to clinicians, researchers, and patients, yet their deployment in high-stakes clinical environments raises critical concerns regarding accuracy, reliability, and patient safety. Despite substantial attention in recent years, standardized benchmarking of LLMs for medical applications has been limited. Objective: This study addresses the need for a standardized comparative evaluation of LLMs in medical settings. Method: We evaluate multiple models, including ChatGPT, LLaMA, Grok, Gemini, and ChatDoctor, on core medical tasks such as patient note summarization and medical question answering, using the open-access datasets, MedMCQA, PubMedQA, and Asclepius, and assess performance through a combination of linguistic and task-specific metrics. Results: The results indicate that domain-specific models, such as ChatDoctor, excel in contextual reliability, producing medically accurate and semantically aligned text, whereas general-purpose models like Grok and LLaMA perform better in structured question-answering tasks, demonstrating higher quantitative accuracy. This highlights the complementary strengths of domain-specific and general-purpose LLMs depending on the medical task. Conclusion: Our findings suggest that LLMs can meaningfully support medical professionals and enhance clinical decision-making; however, their safe and effective deployment requires adherence to ethical standards, contextual accuracy, and human oversight in relevant cases. These results underscore the importance of task-specific evaluation and cautious integration of LLMs into healthcare workflows.
Abstract（参考訳）: 背景: 大規模言語モデル(LLM)は、複雑な医療テキストを理解し、生成し、要約する能力によって、医療における人工知能の応用を変革している。臨床医、研究者、患者に貴重な支援を提供するが、高い精度の臨床環境への展開は、正確性、信頼性、患者の安全性に関する重要な懸念を提起する。近年は注目が集まっているが、医学応用のためのLSMの標準化されたベンチマークは限られている。目的: 本研究は, LLMの標準化された比較評価の必要性に対処するものである。方法】ChatGPT,LLaMA,Grok,Gemini,ChatDoctorなどの複数のモデルを,患者ノートの要約や医療質問応答などの中核的な医療タスクにおいて,MedMCQA,PubMedQA,Asclepiusといったオープンアクセスデータセットを用いて評価し,言語とタスク固有のメトリクスの組み合わせによるパフォーマンス評価を行う。その結果、ChatDoctorのようなドメイン固有モデルは、文脈的信頼性に優れ、医学的に正確で意味論的に整合したテキストを生成するのに対し、GrokやLLaMAのような汎用モデルは、構造化された質問応答タスクにおいてより良い性能を示し、より正確な精度を示すことが示唆された。このことは、医学的課題に応じて、ドメイン特化および汎用LSMの相補的な強みを強調している。結論: LLMは医療専門家を有意義に支援し, 臨床的意思決定を促進することができると考えられるが, その安全かつ効果的な展開には, 倫理的基準, 文脈的正確性, 人的監督の順守が必要である。これらの結果から, LLMのタスク固有の評価と, 医療ワークフローへの慎重な統合の重要性が浮き彫りとなった。

論文の概要: Comparative Analysis of Large Language Models in Healthcare

関連論文リスト