Fugu-MT 論文翻訳(概要): MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

論文の概要: MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models

arxiv url: http://arxiv.org/abs/2312.12806v1
Date: Wed, 20 Dec 2023 07:01:49 GMT
ステータス: 翻訳完了
システム内更新日: 2023-12-21 16:41:04.097643
Title: MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models
Title（参考訳）: MedBench: 医療用大規模言語モデル評価のための大規模中国語ベンチマーク
Authors: Yan Cai, Linlin Wang, Ye Wang, Gerard de Melo, Ya Zhang, Yanfeng Wang, Liang He
Abstract要約: 中国の医療分野の総合的なベンチマークであるMedBenchを紹介する。このベンチマークは、中国の医療ライセンス試験、居住者標準化訓練試験、および現実世界のクリニックの4つの主要なコンポーネントで構成されている。幅広い実験を行い, 多様な視点から詳細な分析を行い, 以下の結果を得た。
参考スコア（独自算出の注目度）: 56.36916128631784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establishing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.
Abstract（参考訳）: 医学領域における様々な医学大言語モデル(LLM)の出現は、LCMのマニュアル評価が時間と労働集約性を証明し、統一的な評価基準の必要性を強調している。この問題を解決するため,中国医学領域の総合的なベンチマークであるMedBenchを紹介した。特に、このベンチマークは、中国の医療ライセンス試験、住民標準化研修試験、医師の資格試験、検査、診断、治療を含む現実世界の診療例の4つの主要な構成要素から構成されている。メドベンチは、中国本土の医師の教育的進歩と臨床実践経験を再現し、医学言語学習モデルにおける知識と推論能力の習得を評価するための信頼性の高いベンチマークとして確立した。 1) 本ベンチマークでは, 臨床知識と診断精度の大幅な向上の必要性を強調し, 広範にわたる実験を行い, 様々な観点から詳細な分析を行った。 2)いくつかの一般ドメイン LLM は驚くほど医学的知識を持っている。これらの知見は、医学研究コミュニティを支援するという究極の目標を掲げ、メドベンチの文脈におけるLSMの能力と限界を解明するものである。

関連論文リスト

KokushiMD-10: Benchmark for Evaluating Large Language Models on Ten Japanese National Healthcare Licensing Examinations [6.453078564406654]
国師MD-10は、日本の10の医療免許試験から構築された最初のマルチモーダルベンチマークである。このベンチマークは、医学、歯科医学、看護学、薬局、および関連医療専門職を含む複数の分野にまたがる。実際の11588以上の質問が含まれており、臨床画像と専門家による注釈付き根拠を取り入れて、テキストと視覚的推論の両方を評価している。
論文参考訳（メタデータ） (2025-06-09T02:26:02Z)
Polish Medical Exams: A new dataset for cross-lingual medical knowledge transfer assessment [0.865489625605814]
本研究では,ポーランドの医療ライセンシングと専門化試験に基づく新しいベンチマークデータセットを提案する。ポーランド語と英語のパラレルコーパスのサブセットを含む24,000以上の試験質問を含んでいる。汎用・ドメイン特化・ポーランド特化モデルを含む最先端のLCMを評価し,その性能を人間医学生と比較した。
論文参考訳（メタデータ） (2024-11-30T19:02:34Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBenchは、14のエキスパートによるコア臨床シナリオを備えた総合的なベンチマークである。このベンチマークの信頼性はいくつかの点で確認されている。
論文参考訳（メタデータ） (2024-10-04T15:15:36Z)
MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
メドベンチ(MedBench)は、中国の医学LLMの総合的、標準化され、信頼性の高いベンチマークシステムである。まず、MedBenchは43の臨床専門分野をカバーするために、最大の評価データセット(300,901の質問)を組み立てる。第3に、MedBenchは動的評価機構を実装し、ショートカット学習や解答記憶を防ぐ。
論文参考訳（メタデータ） (2024-06-24T02:25:48Z)
MultifacetEval: Multifaceted Evaluation to Probe LLMs in Mastering Medical Knowledge [4.8004472307210255]
大規模言語モデル (LLMs) はドメイン間で優れており、医療評価ベンチマークで顕著なパフォーマンスを提供している。しかし、実際の医療シナリオにおける報告されたパフォーマンスと実践的効果の間には、依然として大きなギャップがある。医療知識のエンコーディングと習得におけるLLMの程度と範囲を検討するための,新しい評価フレームワークであるMultifacetEvalを開発した。
論文参考訳（メタデータ） (2024-06-05T04:15:07Z)
MedKP: Medical Dialogue with Knowledge Enhancement and Clinical Pathway Encoding [48.348511646407026]
本稿では,知識向上と臨床パスウェイ符号化フレームワークを用いた医療対話について紹介する。このフレームワークは、医療知識グラフを介して外部知識増強モジュールと、医療機関および医師の行動を介して、内部臨床経路をコードする。
論文参考訳（メタデータ） (2024-03-11T10:57:45Z)
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [59.60384461302662]
医療マルチモーダル大言語モデル(Med-MLLM)を評価するための新しいベンチマークであるAsclepiusを紹介する。 Asclepiusは、異なる医療専門性と異なる診断能力の観点から、モデル能力の厳密かつ包括的に評価する。また、6つのMed-MLLMの詳細な分析を行い、5人の専門家と比較した。
論文参考訳（メタデータ） (2024-02-17T08:04:23Z)
PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain [24.411904114158673]
我々は、中国生物医学言語理解評価(CBlue)ベンチマークを大規模なプロンプトチューニングベンチマークであるPromptCBlueに再構築した。我々のベンチマークは、幅広いバイオメディカルタスクにおいて、中国のLCMのマルチタスク能力を評価するのに適したテストベッドであり、オンラインプラットフォームである。
論文参考訳（メタデータ） (2023-10-22T02:20:38Z)
CMB: A Comprehensive Medical Benchmark in Chinese [67.69800156990952]
そこで我々は,中国語の包括的医療ベンチマークであるCMB(Comprehensive Medical Benchmark)を提案する。伝統的な中国医学はこの評価に欠かせないものであるが、全体としては成り立たない。われわれは,ChatGPT,GPT-4,中国専用LSM,医療分野に特化したLSMなど,いくつかの大規模LSMを評価した。
論文参考訳（メタデータ） (2023-08-17T07:51:23Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。