CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
- URL: http://arxiv.org/abs/2502.11703v1
- Date: Mon, 17 Feb 2025 11:40:48 GMT
- Title: CMQCIC-Bench: A Chinese Benchmark for Evaluating Large Language Models in Medical Quality Control Indicator Calculation
- Authors: Guangya Yu, Yanhao Li, Zongying Jiang, Yuxiong Jin, Li Dai, Yupian Lin, Ruihui Hou, Weiyan Zhang, Yongqi Fan, Qi Ye, Jingping Liu, Tong Ruan,
- Abstract summary: We propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) for the Medical Quality Control Indicator Calculation (MQCIC)
We propose a semi-automatic method to enhance the rule representation and a Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions.
Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks.
- Score: 10.647090420719666
- License:
- Abstract: Medical quality control indicators are essential to assess the qualifications of healthcare institutions for medical services. With the impressive performance of large language models (LLMs) like GPT-4 in the medical field, leveraging these technologies for the Medical Quality Control Indicator Calculation (MQCIC) presents a promising approach. In this work, (1) we introduce a real-world task MQCIC and propose an open-source Chinese electronic medical records (EMRs)-based dataset (CMQCIC-Bench) comprising 785 instances and 76 indicators. (2) We propose a semi-automatic method to enhance the rule representation. Then we propose the Clinical Facts-based Inferential Rule (CF-IR) method that disentangles the clinical fact verification and inferential rule reasoning actions. (3) We conduct comprehensive experiments on 20 representative LLMs, covering general and medical models. Our findings reveal that CF-IR outperforms Chain-of-Thought methods in MQCIC tasks. (4) We conduct an error analysis and investigate the capabilities of clinical fact verification and inferential rule reasoning, providing insights to improve performance in the MQCIC further. The dataset and code is available in this repo https://anonymous.4open.science/r/C-MQCIC-1151.
Related papers
- Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization [2.380499804323775]
We introduce a Multi-Task Learning framework that jointly trains CQA models for both answer extraction and medical categorization.
In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports.
Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization.
arXiv Detail & Related papers (2025-02-18T18:20:37Z) - Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.
We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - MedQA-CS: Benchmarking Large Language Models Clinical Skills Using an AI-SCE Framework [9.747685145146836]
We introduce MedQA-CS, an AI-SCE framework inspired by medical education's Objective Structured Clinical Examinations (OSCEs)
Our contributions include developing MedQA-CS, a comprehensive evaluation framework with publicly available data and expert annotations.
Our experiments show that MedQA-CS is a more challenging benchmark for evaluating clinical skills than traditional multiple-choice QA benchmarks.
arXiv Detail & Related papers (2024-10-02T13:47:17Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - K-QA: A Real-World Medical Q&A Benchmark [12.636564634626422]
We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
arXiv Detail & Related papers (2024-01-25T20:11:04Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - PMC-LLaMA: Towards Building Open-source Language Models for Medicine [62.39105735933138]
Large Language Models (LLMs) have showcased remarkable capabilities in natural language understanding.
LLMs struggle in domains that require precision, such as medical applications, due to their lack of domain-specific knowledge.
We describe the procedure for building a powerful, open-source language model specifically designed for medicine applications, termed as PMC-LLaMA.
arXiv Detail & Related papers (2023-04-27T18:29:05Z) - Large Language Models for Biomedical Knowledge Graph Construction:
Information extraction from EMR notes [0.0]
We propose an end-to-end machine learning solution based on large language models (LLMs)
The entities used in the KG construction process are diseases, factors, treatments, as well as manifestations that coexist with the patient while experiencing the disease.
The application of the proposed methodology is demonstrated on age-related macular degeneration.
arXiv Detail & Related papers (2023-01-29T15:52:33Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.