LongHealth: A Question Answering Benchmark with Long Clinical Documents
- URL: http://arxiv.org/abs/2401.14490v1
- Date: Thu, 25 Jan 2024 19:57:00 GMT
- Title: LongHealth: A Question Answering Benchmark with Long Clinical Documents
- Authors: Lisa Adams, Felix Busch, Tianyu Han, Jean-Baptiste Excoffier, Matthieu
Ortala, Alexander L\"oser, Hugo JWL. Aerts, Jakob Nikolas Kather, Daniel
Truhn, Keno Bressem
- Abstract summary: We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases.
The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting.
We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
- Score: 36.05587855811346
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Background: Recent advancements in large language models (LLMs) offer
potential benefits in healthcare, particularly in processing extensive patient
records. However, existing benchmarks do not fully assess LLMs' capability in
handling real-world, lengthy clinical data.
Methods: We present the LongHealth benchmark, comprising 20 detailed
fictional patient cases across various diseases, with each case containing
5,090 to 6,754 words. The benchmark challenges LLMs with 400 multiple-choice
questions in three categories: information extraction, negation, and sorting,
challenging LLMs to extract and interpret information from large clinical
documents.
Results: We evaluated nine open-source LLMs with a minimum of 16,000 tokens
and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for
comparison. The highest accuracy was observed for Mixtral-8x7B-Instruct-v0.1,
particularly in tasks focused on information retrieval from single and multiple
patient documents. However, all models struggled significantly in tasks
requiring the identification of missing information, highlighting a critical
area for improvement in clinical data interpretation.
Conclusion: While LLMs show considerable potential for processing long
clinical documents, their current accuracy levels are insufficient for reliable
clinical use, especially in scenarios requiring the identification of missing
information. The LongHealth benchmark provides a more realistic assessment of
LLMs in a healthcare setting and highlights the need for further model
refinement for safe and effective clinical application.
We make the benchmark and evaluation code publicly available.
Related papers
- SemioLLM: Assessing Large Language Models for Semiological Analysis in Epilepsy Research [45.2233252981348]
Large Language Models have shown promising results in their ability to encode general medical knowledge.
We test the ability of state-of-the-art LLMs to leverage their internal knowledge and reasoning for epilepsy diagnosis.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - CliBench: Multifaceted Evaluation of Large Language Models in Clinical Decisions on Diagnoses, Procedures, Lab Tests Orders and Prescriptions [16.310913127940857]
We introduce CliBench, a novel benchmark developed from the MIMIC IV dataset.
This benchmark offers a comprehensive and realistic assessment of LLMs' capabilities in clinical diagnosis.
We conduct a zero-shot evaluation of leading LLMs to assess their proficiency in clinical decision-making.
arXiv Detail & Related papers (2024-06-14T11:10:17Z) - CLUE: A Clinical Language Understanding Evaluation for LLMs [2.3814275542331385]
Large Language Models (LLMs) are expected to significantly contribute to patient care, diagnostics, and administrative processes.
Assessing the models' suitability for this sensitive application area is of utmost importance.
We present the Clinical Language Understanding Evaluation (CLUE), a benchmark tailored to evaluate LLMs on clinical tasks.
arXiv Detail & Related papers (2024-04-05T12:51:37Z) - EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large
Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs)
Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities.
We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z) - MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Large Language Models for Healthcare Data Augmentation: An Example on
Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM)
Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.