GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners
- URL: http://arxiv.org/abs/2503.17599v1
- Date: Sat, 22 Mar 2025 01:02:44 GMT
- Title: GPBench: A Comprehensive and Fine-Grained Benchmark for Evaluating Large Language Models as General Practitioners
- Authors: Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Junrong Chen, Lin Yao,
- Abstract summary: General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services.<n>Due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings.<n>Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice.<n>To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework.
- Score: 12.208184074411896
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: General practitioners (GPs) serve as the cornerstone of primary healthcare systems by providing continuous and comprehensive medical services. However, due to community-oriented nature of their practice, uneven training and resource gaps, the clinical proficiency among GPs can vary significantly across regions and healthcare settings. Currently, Large Language Models (LLMs) have demonstrated great potential in clinical and medical applications, making them a promising tool for supporting general practice. However, most existing benchmarks and evaluation frameworks focus on exam-style assessments-typically multiple-choice question-lack comprehensive assessment sets that accurately mirror the real-world scenarios encountered by GPs. To evaluate how effectively LLMs can make decisions in the daily work of GPs, we designed GPBench, which consists of both test questions from clinical practice and a novel evaluation framework. The test set includes multiple-choice questions that assess fundamental knowledge of general practice, as well as realistic, scenario-based problems. All questions are meticulously annotated by experts, incorporating rich fine-grained information related to clinical management. The proposed LLM evaluation framework is based on the competency model for general practice, providing a comprehensive methodology for assessing LLM performance in real-world settings. As the first large-model evaluation set targeting GP decision-making scenarios, GPBench allows us to evaluate current mainstream LLMs. Expert assessment and evaluation reveal that in areas such as disease staging, complication recognition, treatment detail, and medication usage, these models exhibit at least ten major shortcomings. Overall, existing LLMs are not yet suitable for independent use in real-world GP working scenarios without human oversight.
Related papers
- Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Automating Expert-Level Medical Reasoning Evaluation of Large Language Models [26.702477426812333]
We introduce MedThink-Bench, a benchmark for rigorous, explainable, and scalable assessment of large language models' medical reasoning.<n>We also propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms.<n>Overall, MedThink-Bench offers a foundational tool for evaluating LLMs' medical reasoning, advancing their safe and responsible deployment in clinical practice.
arXiv Detail & Related papers (2025-07-10T17:58:26Z) - MedGUIDE: Benchmarking Clinical Decision-Making in Large Language Models [10.46932473088646]
We introduce MedGUIDE, a new benchmark for evaluating Large Language Models (LLMs) on their ability to make guideline-consistent clinical decisions.<n> MedGUIDE is constructed from 55 curated NCCN decision trees across 17 cancer types.<n>We apply a two-stage quality selection process, combining expert-labeled reward models and LLM-as-a-judge ensembles across ten clinical and linguistic criteria, to select 7,747 high-quality samples.
arXiv Detail & Related papers (2025-05-16T18:21:52Z) - Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - OphthBench: A Comprehensive Benchmark for Evaluating Large Language Models in Chinese Ophthalmology [7.743511021846898]
Large language models (LLMs) have shown significant promise across various medical applications.<n>We introduce the OphthBench, a benchmark designed to assess LLM performance within the context of Chinese ophthalmic practices.<n>This framework allows for a thorough assessment of LLMs' capabilities and provides insights into their practical application in Chinese ophthalmology.
arXiv Detail & Related papers (2025-02-03T11:04:51Z) - Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation [31.061600616994145]
HDCEval is built on a set of fine-grained medical evaluation guidelines developed in collaboration with professional doctors.<n>The framework decomposes complex evaluation tasks into specialized subtasks, each evaluated by expert models.<n>This hierarchical approach ensures that each aspect of the evaluation is handled with expert precision, leading to a significant improvement in alignment with human evaluators.
arXiv Detail & Related papers (2025-01-12T07:30:49Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models [55.215061531495984]
"MedBench" is a comprehensive, standardized, and reliable benchmarking system for Chinese medical LLM.
First, MedBench assembles the largest evaluation dataset (300,901 questions) to cover 43 clinical specialties.
Third, MedBench implements dynamic evaluation mechanisms to prevent shortcut learning and answer remembering.
arXiv Detail & Related papers (2024-06-24T02:25:48Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - Does Biomedical Training Lead to Better Medical Performance? [2.3814275542331385]
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes.
This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models.
arXiv Detail & Related papers (2024-04-05T12:51:37Z) - A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [57.88111980149541]
We introduce Asclepius, a novel Med-MLLM benchmark that assesses Med-MLLMs in terms of distinct medical specialties and different diagnostic capacities.<n>Grounded in 3 proposed core principles, Asclepius ensures a comprehensive evaluation by encompassing 15 medical specialties.<n>We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 3 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - Can Large Language Models be Trusted for Evaluation? Scalable
Meta-Evaluation of LLMs as Evaluators via Agent Debate [74.06294042304415]
We propose ScaleEval, an agent-debate-assisted meta-evaluation framework.
We release the code for our framework, which is publicly available on GitHub.
arXiv Detail & Related papers (2024-01-30T07:03:32Z) - Evaluation of General Large Language Models in Contextually Assessing
Semantic Concepts Extracted from Adult Critical Care Electronic Health Record
Notes [17.648021186810663]
The purpose of this study was to evaluate the performance of Large Language Models (LLMs) in understanding and processing real-world clinical notes.
The GPT family models have demonstrated considerable efficiency, evidenced by their cost-effectiveness and time-saving capabilities.
arXiv Detail & Related papers (2024-01-24T16:52:37Z) - Systematic Assessment of Factual Knowledge in Large Language Models [48.75961313441549]
This paper proposes a framework to assess the factual knowledge of large language models (LLMs) by leveraging knowledge graphs (KGs)
Our framework automatically generates a set of questions and expected answers from the facts stored in a given KG, and then evaluates the accuracy of LLMs in answering these questions.
arXiv Detail & Related papers (2023-10-18T00:20:50Z) - An Automatic Evaluation Framework for Multi-turn Medical Consultations
Capabilities of Large Language Models [22.409334091186995]
Large language models (LLMs) often suffer from hallucinations, leading to overly confident but incorrect judgments.
This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations.
arXiv Detail & Related papers (2023-09-05T09:24:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.