K-QA: A Real-World Medical Q&A Benchmark
- URL: http://arxiv.org/abs/2401.14493v1
- Date: Thu, 25 Jan 2024 20:11:04 GMT
- Title: K-QA: A Real-World Medical Q&A Benchmark
- Authors: Itay Manes, Naama Ronn, David Cohen, Ran Ilan Ber, Zehavi
Horowitz-Kugler, Gabriel Stanovsky
- Abstract summary: We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
- Score: 12.636564634626422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring the accuracy of responses provided by large language models (LLMs)
is crucial, particularly in clinical settings where incorrect information may
directly impact patient health. To address this challenge, we construct K-QA, a
dataset containing 1,212 patient questions originating from real-world
conversations held on K Health (an AI-driven clinical platform). We employ a
panel of in-house physicians to answer and manually decompose a subset of K-QA
into self-contained statements. Additionally, we formulate two NLI-based
evaluation metrics approximating recall and precision: (1) comprehensiveness,
measuring the percentage of essential clinical information in the generated
answer and (2) hallucination rate, measuring the number of statements from the
physician-curated response contradicted by the LLM answer. Finally, we use K-QA
along with these metrics to evaluate several state-of-the-art models, as well
as the effect of in-context learning and medically-oriented augmented retrieval
schemes developed by the authors. Our findings indicate that in-context
learning improves the comprehensiveness of the models, and augmented retrieval
is effective in reducing hallucinations. We make K-QA available to to the
community to spur research into medically accurate NLP applications.
Related papers
- EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Zero-Shot Clinical Trial Patient Matching with LLMs [40.31971412825736]
Large language models (LLMs) offer a promising solution to automated screening.
We design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria.
Our system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark.
arXiv Detail & Related papers (2024-02-05T00:06:08Z) - Self-Verification Improves Few-Shot Clinical Information Extraction [73.6905567014859]
Large language models (LLMs) have shown the potential to accelerate clinical curation via few-shot in-context learning.
They still struggle with issues regarding accuracy and interpretability, especially in mission-critical domains such as health.
Here, we explore a general mitigation framework using self-verification, which leverages the LLM to provide provenance for its own extraction and check its own outputs.
arXiv Detail & Related papers (2023-05-30T22:05:11Z) - Question-Answering System Extracts Information on Injection Drug Use
from Clinical Notes [4.537953996010351]
Injection drug use (IDU) is a dangerous health behavior that increases mortality and morbidity.
The only place IDU information can be indicated is unstructured free-text clinical notes.
We design and demonstrate a question-answering (QA) framework to extract information on IDU from clinical notes.
arXiv Detail & Related papers (2023-05-15T16:37:00Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z) - Informing clinical assessment by contextualizing post-hoc explanations
of risk prediction models in type-2 diabetes [50.8044927215346]
We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state.
We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability.
Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
arXiv Detail & Related papers (2023-02-11T18:07:11Z) - Large Language Models Encode Clinical Knowledge [21.630872464930587]
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
arXiv Detail & Related papers (2022-12-26T14:28:24Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z) - Attention-based Aspect Reasoning for Knowledge Base Question Answering
on Clinical Notes [12.831807443341214]
We aim at creating knowledge base from clinical notes to link different patients and clinical notes, and performing knowledge base question answering (KBQA)
Based on the expert annotations in n2c2, we first created the ClinicalKBQA dataset that includes 8,952 QA pairs and covers questions about seven medical topics through 322 question templates.
We propose an attention-based aspect reasoning (AAR) method for KBQA and investigated the impact of different aspects of answers for prediction.
arXiv Detail & Related papers (2021-08-01T17:58:46Z) - Benchmarking Automated Clinical Language Simplification: Dataset,
Algorithm, and Evaluation [48.87254340298189]
We construct a new dataset named MedLane to support the development and evaluation of automated clinical language simplification approaches.
We propose a new model called DECLARE that follows the human annotation procedure and achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-12-04T06:09:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.