Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
- URL: http://arxiv.org/abs/2402.18060v4
- Date: Tue, 25 Jun 2024 21:17:09 GMT
- Title: Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions
- Authors: Hanjie Chen, Zhouxiang Fang, Yash Singla, Mark Dredze,
- Abstract summary: We construct two new datasets: JAMA Clinical Challenge and Medbullets.
JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions.
We evaluate seven LLMs on the two datasets using various prompts.
- Score: 19.436999992810797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLMs have demonstrated impressive performance in answering medical questions, such as achieving passing scores on medical licensing examinations. However, medical board exam or general clinical questions do not capture the complexity of realistic clinical cases. Moreover, the lack of reference explanations means we cannot easily evaluate the reasoning of model decisions, a crucial component of supporting doctors in making complex medical decisions. To address these challenges, we construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. Both datasets are structured as multiple-choice question-answering tasks, accompanied by expert-written explanations. We evaluate seven LLMs on the two datasets using various prompts. Experiments demonstrate that our datasets are harder than previous benchmarks. Human and automatic evaluations of model-generated explanations provide insights into the promise and deficiency of LLMs for explainable medical QA.
Related papers
- GEMeX: A Large-Scale, Groundable, and Explainable Medical VQA Benchmark for Chest X-ray Diagnosis [44.76975131560712]
We introduce a large-scale, Groundable, and Explainable Medical VQA benchmark for chest X-ray diagnosis (GEMeX)
A multi-modal explainability mechanism offers detailed visual and textual explanations for each question-answer pair.
Four distinct question types, open-ended, closed-ended, single-choice, and multiple-choice, better reflect diverse clinical needs.
arXiv Detail & Related papers (2024-11-25T07:36:46Z) - RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions [3.182594503527438]
We present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM.
We show that the LLM is more cost-efficient for generating "ideal" QA pairs.
arXiv Detail & Related papers (2024-08-16T09:32:43Z) - MedREQAL: Examining Medical Knowledge Recall of Large Language Models via Question Answering [5.065947993017158]
Large Language Models (LLMs) have demonstrated an impressive ability to encode knowledge during pre-training on large text corpora.
We examine the capability of LLMs to exhibit medical knowledge recall by constructing a novel dataset derived from systematic reviews.
arXiv Detail & Related papers (2024-06-09T16:33:28Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Medical Question Summarization with Entity-driven Contrastive Learning [12.008269098530386]
This paper proposes a novel medical question summarization framework using entity-driven contrastive learning (ECL)
ECL employs medical entities in frequently asked questions (FAQs) as focuses and devises an effective mechanism to generate hard negative samples.
We find that some MQA datasets suffer from serious data leakage problems, such as the iCliniq dataset's 33% duplicate rate.
arXiv Detail & Related papers (2023-04-15T00:19:03Z) - VBridge: Connecting the Dots Between Features, Explanations, and Data
for Healthcare Models [85.4333256782337]
VBridge is a visual analytics tool that seamlessly incorporates machine learning explanations into clinicians' decision-making workflow.
We identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence.
We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians.
arXiv Detail & Related papers (2021-08-04T17:34:13Z) - Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching.
We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders.
We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z) - Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex
Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam.
These questions are the most challenging for current QA systems.
We present a Multi-step reasoning with Knowledge extraction framework (MurKe)
We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.