Related papers: PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models

URL: http://arxiv.org/abs/2412.06287v3
Date: Fri, 28 Feb 2025 07:54:16 GMT
Title: PediaBench: A Comprehensive Chinese Pediatric Dataset for Benchmarking Large Language Models
Authors: Qian Zhang, Panfeng Chen, Jiali Li, Linkun Feng, Shuyu Liu, Heng Zhao, Mei Chen, Hui Li, Yanhao Wang,
Abstract summary: We construct PediaBench, the first Chinese pediatric dataset for LLM evaluation.<n>It contains 4,117 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups.<n>It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM.
Score: 15.568564652381408
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The emergence of Large Language Models (LLMs) in the medical domain has stressed a compelling need for standard datasets to evaluate their question-answering (QA) performance. Although there have been several benchmark datasets for medical QA, they either cover common knowledge across different departments or are specific to another department rather than pediatrics. Moreover, some of them are limited to objective questions and do not measure the generation capacity of LLMs. Therefore, they cannot comprehensively assess the QA ability of LLMs in pediatrics. To fill this gap, we construct PediaBench, the first Chinese pediatric dataset for LLM evaluation. Specifically, it contains 4,117 objective questions and 1,632 subjective questions spanning 12 pediatric disease groups. It adopts an integrated scoring criterion based on different difficulty levels to thoroughly assess the proficiency of an LLM in instruction following, knowledge understanding, clinical case analysis, etc. Finally, we validate the effectiveness of PediaBench with extensive experiments on 20 open-source and commercial LLMs. Through an in-depth analysis of experimental results, we offer insights into the ability of LLMs to answer pediatric questions in the Chinese context, highlighting their limitations for further improvements. Our code and data are published at https://github.com/ACMISLab/PediaBench.

Related papers

MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Fact or Guesswork? Evaluating Large Language Model's Medical Knowledge with Structured One-Hop Judgment [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their ability to directly recall and apply factual medical knowledge remains under-explored. Most existing medical QA benchmarks assess complex reasoning or multi-hop inference, making it difficult to isolate LLMs' inherent medical knowledge from their reasoning capabilities. We introduce the Medical Knowledge Judgment, a dataset specifically designed to measure LLMs' one-hop factual medical knowledge.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
A Benchmark for Long-Form Medical Question Answering [4.815957808858573]
There is a lack of benchmarks for evaluating large language models (LLMs) in long-form medical question answering (QA) Most existing medical QA evaluation benchmarks focus on automatic metrics and multiple-choice questions. In this work, we introduce a new publicly available benchmark featuring real-world consumer medical questions with long-form answer evaluations annotated by medical doctors.
arXiv Detail & Related papers (2024-11-14T22:54:38Z)
CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios. The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z)
A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor? [33.70022886795487]
OpenAI's o1 stands out as the first model with a chain-of-thought technique using reinforcement learning strategies. This report provides a comprehensive exploration of o1 on different medical scenarios, examining 3 key aspects: understanding, reasoning, and multilinguality.
arXiv Detail & Related papers (2024-09-23T17:59:43Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
Performance of large language models in numerical vs. semantic medical knowledge: Benchmarking on evidence-based Q&As [1.0034156461900003]
Large language models (LLMs) show promising results in many aspects of language-based clinical practice. We used a comprehensive medical knowledge graph (encompassed data from more than 50,00 peer-reviewed articles) and created the "EBMQA" We benchmarked this dataset using more than 24,500 questions on two state-of-the-art LLMs: Chat-GPT4 and Claude3-Opus. We found that both LLMs excelled more in semantic than numerical QAs, with Claude3 surpassing GPT4 in numerical QAs.
arXiv Detail & Related papers (2024-06-06T08:41:46Z)
Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex. This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities. We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain. This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z)
Quantifying Self-diagnostic Atomic Knowledge in Chinese Medical Foundation Model: A Computational Analysis [55.742339781494046]
Foundation Models (FMs) have the potential to revolutionize the way users self-diagnose through search engines by offering direct and efficient suggestions. Recent studies primarily focused on the quality of FMs evaluated by GPT-4 or their ability to pass medical exams. No studies have quantified the extent of self-diagnostic atomic knowledge stored in FMs' memory.
arXiv Detail & Related papers (2023-10-18T05:42:22Z)
Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT) LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules. We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z)
Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese Medical Exam Dataset [31.047827145874844]
We introduce CMExam, sourced from the Chinese National Medical Licensing Examination. CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner. For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
arXiv Detail & Related papers (2023-06-05T16:48:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.