Fugu-MT 論文翻訳(概要): Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

論文の概要: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis

arxiv url: http://arxiv.org/abs/2506.02987v1
Date: Tue, 03 Jun 2025 15:25:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-05 01:42:09.43361
Title: Performance of leading large language models in May 2025 in Membership of the Royal College of General Practitioners-style examination questions: a cross-sectional analysis
Title（参考訳）: 2025年5月王立開業医大学試験会における指導的大言語モデルの性能--横断的分析
Authors: Richard Armitage,
Abstract要約: o3、Claude Opus 4、Grok3、およびGemini 2.5 Proは、Royal College of General Practitioners GP SelfTestからランダムに選ばれた100の質問に答えるよう命じられた。 o3, Claude Opus 4, Grok3, Gemini 2.5 Proの合計スコアはそれぞれ99.0%, 95.0%, 95.0%, 95.0%であった。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Background: Large language models (LLMs) have demonstrated substantial potential to support clinical practice. Other than Chat GPT4 and its predecessors, few LLMs, especially those of the leading and more powerful reasoning model class, have been subjected to medical specialty examination questions, including in the domain of primary care. This paper aimed to test the capabilities of leading LLMs as of May 2025 (o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro) in primary care education, specifically in answering Member of the Royal College of General Practitioners (MRCGP) style examination questions. Methods: o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro were tasked to answer 100 randomly chosen multiple choice questions from the Royal College of General Practitioners GP SelfTest on 25 May 2025. Questions included textual information, laboratory results, and clinical images. Each model was prompted to answer as a GP in the UK and was provided with full question information. Each question was attempted once by each model. Responses were scored against correct answers provided by GP SelfTest. Results: The total score of o3, Claude Opus 4, Grok3, and Gemini 2.5 Pro was 99.0%, 95.0%, 95.0%, and 95.0%, respectively. The average peer score for the same questions was 73.0%. Discussion: All models performed remarkably well, and all substantially exceeded the average performance of GPs and GP registrars who had answered the same questions. o3 demonstrated the best performance, while the performances of the other leading models were comparable with each other and were not substantially lower than that of o3. These findings strengthen the case for LLMs, particularly reasoning models, to support the delivery of primary care, especially those that have been specifically trained on primary care clinical data.
Abstract（参考訳）: 背景: 大規模言語モデル (LLMs) は, 臨床実践を支援する大きな可能性を示している。 Chat GPT4とその先駆者以外では、特に指導的かつ強力な推論モデルクラスのLSMは、プライマリケアの領域を含む、医学的専門性試験の質問の対象となっている。本稿では,2025年5月現在,プライマリケア教育におけるLLM(O3,Claude Opus 4,Grok3,Gemini 2.5 Pro)をリードする能力をテストすることを目的とした。方法:o3、Claude Opus 4、Grok3、Gemini 2.5 Proは2025年5月25日に王立開業医大学GP SelfTestからランダムに選択された100の選択肢に答えるよう命じられた。質問には、テキスト情報、実験結果、臨床画像が含まれていた。それぞれのモデルは、イギリスでGPとして答えるよう促され、完全な質問情報が提供された。各質問は各モデルで一度試みられた。 GP SelfTestが提供する正しい回答に対して回答が得られた。結果: o3, Claude Opus 4, Grok3, Gemini 2.5 Proの合計スコアはそれぞれ99.0%, 95.0%, 95.0%, 95.0%であった。同じ質問に対する平均ピアスコアは73.0%であった。討論: すべてのモデルが極めて良好に動作し, 同じ質問に答えたGPとGPレジストラの平均性能をはるかに上回った。 o3は最高の性能を示し、他の主要モデルの性能は互いに同等であり、o3よりも大幅に低かった。これらの知見は, プライマリケア, 特にプライマリケア臨床データに特化して訓練された, プライマリケアの提供を支援するために, LLM(特に推論モデル)のケースを強化した。

関連論文リスト

MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
大規模言語モデル(LLM)は、医学試験においてほぼ完璧なスコアを得る。これらの評価は、実際の臨床実践の複雑さと多様性を不十分に反映している。 MedHELMは,医療業務におけるLCMの性能を評価するための評価フレームワークである。
論文参考訳（メタデータ） (2025-05-26T22:55:49Z)
It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education [0.7771252627207672]
MCQ(Multiple-choice question)ベンチマークにおけるLLM(Large Language Models)の性能は、その医療能力の証明としてしばしば引用される。我々は、ペアMCQ(FreeMedQA)を用いた自由応答型質問の新しいベンチマークを作成しました。このベンチマークを用いて,3つの最先端LCM (GPT-4o, GPT-3.5, LLama-3-70B-instruct) を評価し,自由応答問題において平均39.43%の性能低下が認められた。
論文参考訳（メタデータ） (2025-03-13T19:42:04Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
MedR-Benchは1,453例の構造化患者のベンチマークデータセットで、推論基準を付した注釈付きである。本稿では,3つの批判的診察勧告,診断決定,治療計画を含む枠組みを提案し,患者のケアジャーニー全体をシミュレートする。このベンチマークを用いて、DeepSeek-R1、OpenAI-o3-mini、Gemini-2.0-Flash Thinkingなど、最先端の5つのLCMを評価した。
論文参考訳（メタデータ） (2025-03-06T18:35:39Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
大規模言語モデル(LLM)は、しばしばオープンエンドの医学的問題に苦しむ。本稿では,構造化医療推論を利用した新しいアプローチを提案する。我々の手法は85.8のファクチュアリティスコアを達成し、微調整されたモデルを上回る。
論文参考訳（メタデータ） (2025-03-05T05:24:55Z)
oRetrieval Augmented Generation for 10 Large Language Models and its Generalizability in Assessing Medical Fitness [4.118721833273984]
大規模言語モデル(LLM)は医学的応用の可能性を示すが、専門的な臨床知識が欠如していることが多い。 Retrieval Augmented Generation (RAG)は、ドメイン固有の情報によるカスタマイズを可能にし、医療に適している。本研究は,手術適応の判定と術前指導におけるRAGモデルの精度,整合性,安全性について検討した。
論文参考訳（メタデータ） (2024-10-11T00:34:20Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
MedS-Benchは大規模言語モデル(LLM)の性能を臨床的に評価するためのベンチマークである。 MedS-Benchは、臨床報告の要約、治療勧告、診断、名前付きエンティティ認識、医療概念説明を含む、11のハイレベルな臨床タスクにまたがる。 MedS-Insは58の医療指向言語コーパスで構成され、112のタスクで1350万のサンプルを収集している。
論文参考訳（メタデータ） (2024-08-22T17:01:34Z)
Pattern Recognition or Medical Knowledge? The Problem with Multiple-Choice Questions in Medicine [3.471944921180245]
大規模言語モデル(LLM)は、医療領域において大きな可能性を示す。これらの質問は、USMLEのような試験をモデルとしたMCQ(Multiple-choice Question)を用いて評価されることが多い。私たちは、想像上のオルガンであるGlianorexを中心とした架空の医療ベンチマークを作成し、記憶された知識と推論能力の分離を可能にしました。
論文参考訳（メタデータ） (2024-06-04T15:08:56Z)
GPT-4 passes most of the 297 written Polish Board Certification Examinations [0.5461938536945723]
本研究では,ポーランド委員会認定試験(Pa'nstwowy Egzamin Specjalizacyjny, PES)における3つの生成事前学習変圧器(GPT)モデルの性能評価を行った。 GPTモデルは、特定の専門分野に関する試験において卓越した性能を示しながら、他の分野では完全に失敗するなど、大きく変化した。
論文参考訳（メタデータ） (2024-04-29T09:08:22Z)
BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text [82.7001841679981]
BioMedLM は270億のパラメータ GPT スタイルの自己回帰モデルであり、PubMed の抽象概念と全記事に特化して訓練されている。微調整すると、BioMedLMはより大規模なモデルと競合する強力な多重選択のバイオメディカルな質問応答結果を生成することができる。 BioMedLMは、医療トピックに関する患者の質問に対する有用な回答を生成するために、微調整することもできる。
論文参考訳（メタデータ） (2024-03-27T10:18:21Z)
Can Generalist Foundation Models Outcompete Special-Purpose Tuning? Case Study in Medicine [89.46836590149883]
本研究は, GPT-4の医学的課題評価における能力について, 専門訓練の欠如による先行研究に基づくものである。イノベーションを促進することで、より深い専門能力が解放され、GPT-4が医学ベンチマークの先行結果に容易に勝っていることが分かる。 Medpromptを使用すると、GPT-4はMultiMedQAスイートのベンチマークデータセットの9つすべてに対して最先端の結果を得る。
論文参考訳（メタデータ） (2023-11-28T03:16:12Z)
A Comparative Study of Open-Source Large Language Models, GPT-4 and Claude 2: Multiple-Choice Test Taking in Nephrology [0.6213359027997152]
本研究は,LLMモデルがNephSAP多重選択質問に対する正しい回答を提供する能力を評価するために行われた。本研究の結果は将来の医療訓練や患者医療に重大な影響を与える可能性がある。
論文参考訳（メタデータ） (2023-08-09T05:01:28Z)
PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
大規模言語モデル(LLM)は、自然言語理解において顕著な能力を示した。 LLMは、ドメイン固有の知識が不足しているため、医学的応用のような正確性を必要とする領域で苦労している。 PMC-LLaMAと呼ばれる医療応用に特化した強力なオープンソース言語モデルの構築手順について述べる。
論文参考訳（メタデータ） (2023-04-27T18:29:05Z)
GPT-4 can pass the Korean National Licensing Examination for Korean Medicine Doctors [9.374652839580182]
本研究は,韓国伝統医学(TKM)におけるGPT-4の能力について検討した。我々は,中国語の長期的アノテーション,質問と指導のための英語翻訳,試験最適化指導,自己整合性でプロンプトを最適化した。最適化されたプロンプトを持つGPT-4は66.18%の精度を達成し、各被験者の平均パスマークは60%、最低40%を超えた。
論文参考訳（メタデータ） (2023-03-31T05:43:21Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。