MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records
- URL: http://arxiv.org/abs/2308.14089v2
- Date: Sun, 24 Dec 2023 09:12:06 GMT
- Title: MedAlign: A Clinician-Generated Dataset for Instruction Following with
Electronic Medical Records
- Authors: Scott L. Fleming, Alejandro Lozano, William J. Haberkorn, Jenelle A.
Jindal, Eduardo P. Reis, Rahul Thapa, Louis Blankemeier, Julian Z. Genkins,
Ethan Steinberg, Ashwin Nayak, Birju S. Patel, Chia-Chun Chiang, Alison
Callahan, Zepeng Huo, Sergios Gatidis, Scott J. Adams, Oluseyi Fayanju,
Shreya J. Shah, Thomas Savage, Ethan Goh, Akshay S. Chaudhari, Nima
Aghaeepour, Christopher Sharp, Michael A. Pfeffer, Percy Liang, Jonathan H.
Chen, Keith E. Morse, Emma P. Brunskill, Jason A. Fries, Nigam H. Shah
- Abstract summary: Large language models (LLMs) can follow natural language instructions with human-level fluency.
evaluating LLMs on realistic text generation tasks for healthcare remains challenging.
We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
- Score: 60.35217378132709
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability of large language models (LLMs) to follow natural language
instructions with human-level fluency suggests many opportunities in healthcare
to reduce administrative burden and improve quality of care. However,
evaluating LLMs on realistic text generation tasks for healthcare remains
challenging. Existing question answering datasets for electronic health record
(EHR) data fail to capture the complexity of information needs and
documentation burdens experienced by clinicians. To address these challenges,
we introduce MedAlign, a benchmark dataset of 983 natural language instructions
for EHR data. MedAlign is curated by 15 clinicians (7 specialities), includes
clinician-written reference responses for 303 instructions, and provides 276
longitudinal EHRs for grounding instruction-response pairs. We used MedAlign to
evaluate 6 general domain LLMs, having clinicians rank the accuracy and quality
of each LLM response. We found high error rates, ranging from 35% (GPT-4) to
68% (MPT-7B-Instruct), and an 8.3% drop in accuracy moving from 32k to 2k
context lengths for GPT-4. Finally, we report correlations between clinician
rankings and automated natural language generation metrics as a way to rank
LLMs without human review. We make MedAlign available under a research data use
agreement to enable LLM evaluations on tasks aligned with clinician needs and
preferences.
Related papers
- Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z) - LongHealth: A Question Answering Benchmark with Long Clinical Documents [36.05587855811346]
We present the LongHealth benchmark, comprising 20 detailed fictional patient cases across various diseases.
The benchmark challenges LLMs with 400 multiple-choice questions in three categories: information extraction, negation, and sorting.
We evaluated nine open-source LLMs with a minimum of 16,000 tokens and also included OpenAI's proprietary and cost-efficient GPT-3.5 Turbo for comparison.
arXiv Detail & Related papers (2024-01-25T19:57:00Z) - LLM on FHIR -- Demystifying Health Records [0.32985979395737786]
This study developed an app allowing users to interact with their health records using large language models (LLMs)
The app effectively translated medical data into patient-friendly language and was able to adapt its responses to different patient profiles.
arXiv Detail & Related papers (2024-01-25T17:45:34Z) - ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain.
ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF.
We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z) - Adapted Large Language Models Can Outperform Medical Experts in Clinical Text Summarization [8.456700096020601]
Large language models (LLMs) have shown promise in natural language processing (NLP), but their effectiveness on a diverse range of clinical summarization tasks remains unproven.
In this study, we apply adaptation methods to eight LLMs, spanning four distinct clinical summarization tasks.
A clinical reader study with ten physicians evaluates summary, completeness, correctness, and conciseness; in a majority of cases, summaries from our best adapted LLMs are either equivalent (45%) or superior (36%) compared to summaries from medical experts.
arXiv Detail & Related papers (2023-09-14T05:15:01Z) - Augmenting Black-box LLMs with Medical Textbooks for Biomedical Question Answering (Published in Findings of EMNLP 2024) [48.17095875619711]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - Large Language Models Leverage External Knowledge to Extend Clinical
Insight Beyond Language Boundaries [48.48630043740588]
Large Language Models (LLMs) such as ChatGPT and Med-PaLM have excelled in various medical question-answering tasks.
We develop a novel in-context learning framework to enhance their performance.
arXiv Detail & Related papers (2023-05-17T12:31:26Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.