Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries
- URL: http://arxiv.org/abs/2305.10163v4
- Date: Tue, 30 Jan 2024 03:58:19 GMT
- Title: Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries
- Authors: Jiageng Wu, Xian Wu, Zhaopeng Qiu, Minghui Li, Yingying Zhang, Yefeng
Zheng, Changzheng Yuan and Jie Yang
- Abstract summary: Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
- Score: 48.48630043740588
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: $\textbf{Objectives}$: Large Language Models (LLMs) such as ChatGPT and
Med-PaLM have excelled in various medical question-answering tasks. However,
these English-centric models encounter challenges in non-English clinical
settings, primarily due to limited clinical knowledge in respective languages,
a consequence of imbalanced training corpora. We systematically evaluate LLMs
in the Chinese medical context and develop a novel in-context learning
framework to enhance their performance.
$\textbf{Materials and Methods}$: The latest China National Medical Licensing
Examination (CNMLE-2022) served as the benchmark. We collected 53 medical books
and 381,149 medical questions to construct the medical knowledge base and
question bank. The proposed Knowledge and Few-shot Enhancement In-context
Learning (KFE) framework leverages the in-context learning ability of LLMs to
integrate diverse external clinical knowledge sources. We evaluated KFE with
ChatGPT(GPT3.5), GPT4, Baichuan2(BC2)-7B, and BC2-13B in CNMLE-2022 and
investigated the effectiveness of different pathways for incorporating LLMs
with medical knowledge from 7 perspectives.
$\textbf{Results}$: Directly applying ChatGPT failed to qualify for the
CNMLE-2022 at a score of 51. Cooperated with the KFE, the LLMs with varying
sizes yielded consistent and significant improvements. The ChatGPT's
performance surged to 70.04 and GPT-4 achieved the highest score of 82.59. This
surpasses the qualification threshold (60) and exceeds the average human score
of 68.70. It also enabled a smaller BC2-13B to pass the examination, showcasing
the great potential in low-resource settings.
$\textbf{Conclusion}$: By synergizing medical knowledge through in-context
learning, LLM can extend clinical insight beyond language barriers,
significantly reducing language-related disparities of LLM applications and
ensuring global benefit in healthcare.
Related papers
- CliMedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models in Clinical Scenarios [50.032101237019205]
CliMedBench is a comprehensive benchmark with 14 expert-guided core clinical scenarios.
The reliability of this benchmark has been confirmed in several ways.
arXiv Detail & Related papers (2024-10-04T15:15:36Z) - Can Large Language Models Logically Predict Myocardial Infarction? Evaluation based on UK Biobank Cohort [10.66506859118868]
Large language models (LLMs) have seen extraordinary advances with applications in clinical decision support.
This study aims to evaluate quantitatively whether universal state-of-the-art LLMs can predict the incidence risk of myocardial infarction (MI) with logical inference.
arXiv Detail & Related papers (2024-09-22T14:57:31Z) - Multiple Choice Questions and Large Languages Models: A Case Study with Fictional Medical Data [3.471944921180245]
We developed a fictional medical benchmark focused on a non-existent gland, the Glianorex.
This approach allowed us to isolate the knowledge of the LLM from its test-taking abilities.
We evaluated various open-source, proprietary, and domain-specific LLMs using these questions in a zero-shot setting.
arXiv Detail & Related papers (2024-06-04T15:08:56Z) - MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large
Language Models [56.36916128631784]
We introduce MedBench, a comprehensive benchmark for the Chinese medical domain.
This benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, and real-world clinic cases.
We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings.
arXiv Detail & Related papers (2023-12-20T07:01:49Z) - PromptCBLUE: A Chinese Prompt Tuning Benchmark for the Medical Domain [24.411904114158673]
We re-build the Chinese Biomedical Language Understanding Evaluation (CBlue) benchmark into a large scale prompt-tuning benchmark, PromptCBlue.
Our benchmark is a suitable test-bed and an online platform for evaluating Chinese LLMs' multi-task capabilities on a wide range bio-medical tasks.
arXiv Detail & Related papers (2023-10-22T02:20:38Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - CMMLU: Measuring massive multitask language understanding in Chinese [133.70911295934746]
This paper introduces a comprehensive Chinese benchmark that covers various subjects, including natural science, social sciences, engineering, and humanities.
CMMLU fills the gap in evaluating the knowledge and reasoning capabilities of large language models within the Chinese context.
arXiv Detail & Related papers (2023-06-15T15:49:51Z) - Benchmarking Large Language Models on CMExam -- A Comprehensive Chinese
Medical Exam Dataset [31.047827145874844]
We introduce CMExam, sourced from the Chinese National Medical Licensing Examination.
CMExam consists of 60K+ multiple-choice questions for standardized and objective evaluations, as well as solution explanations for model reasoning evaluation in an open-ended manner.
For in-depth analyses of LLMs, we invited medical professionals to label five additional question-wise annotations, including disease groups, clinical departments, medical disciplines, areas of competency, and question difficulty levels.
arXiv Detail & Related papers (2023-06-05T16:48:41Z) - Are Large Language Models Ready for Healthcare? A Comparative Study on
Clinical Language Understanding [12.128991867050487]
Large language models (LLMs) have made significant progress in various domains, including healthcare.
In this study, we evaluate state-of-the-art LLMs within the realm of clinical language understanding tasks.
arXiv Detail & Related papers (2023-04-09T16:31:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.