K-QA: A Real-World Medical Q&A Benchmark
- URL: http://arxiv.org/abs/2401.14493v1
- Date: Thu, 25 Jan 2024 20:11:04 GMT
- Title: K-QA: A Real-World Medical Q&A Benchmark
- Authors: Itay Manes, Naama Ronn, David Cohen, Ran Ilan Ber, Zehavi
Horowitz-Kugler, Gabriel Stanovsky
- Abstract summary: We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
- Score: 12.636564634626422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensuring the accuracy of responses provided by large language models (LLMs)
is crucial, particularly in clinical settings where incorrect information may
directly impact patient health. To address this challenge, we construct K-QA, a
dataset containing 1,212 patient questions originating from real-world
conversations held on K Health (an AI-driven clinical platform). We employ a
panel of in-house physicians to answer and manually decompose a subset of K-QA
into self-contained statements. Additionally, we formulate two NLI-based
evaluation metrics approximating recall and precision: (1) comprehensiveness,
measuring the percentage of essential clinical information in the generated
answer and (2) hallucination rate, measuring the number of statements from the
physician-curated response contradicted by the LLM answer. Finally, we use K-QA
along with these metrics to evaluate several state-of-the-art models, as well
as the effect of in-context learning and medically-oriented augmented retrieval
schemes developed by the authors. Our findings indicate that in-context
learning improves the comprehensiveness of the models, and augmented retrieval
is effective in reducing hallucinations. We make K-QA available to to the
community to spur research into medically accurate NLP applications.
Related papers
- Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization [2.380499804323775]
We introduce a Multi-Task Learning framework that jointly trains CQA models for both answer extraction and medical categorization.
In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports.
Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization.
arXiv Detail & Related papers (2025-02-18T18:20:37Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Give me Some Hard Questions: Synthetic Data Generation for Clinical QA [13.436187152293515]
This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting.
We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios.
Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines.
arXiv Detail & Related papers (2024-12-05T19:35:41Z) - Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community Retrieval [61.70489848327436]
KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
arXiv Detail & Related papers (2024-10-06T18:46:28Z) - RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions [3.182594503527438]
We present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM.
We show that the LLM is more cost-efficient for generating "ideal" QA pairs.
arXiv Detail & Related papers (2024-08-16T09:32:43Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using Discharge Summaries [9.031182965159976]
Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
arXiv Detail & Related papers (2024-02-25T09:41:50Z) - Zero-Shot Clinical Trial Patient Matching with LLMs [40.31971412825736]
Large language models (LLMs) offer a promising solution to automated screening.
We design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria.
Our system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark.
arXiv Detail & Related papers (2024-02-05T00:06:08Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z) - Informing clinical assessment by contextualizing post-hoc explanations
of risk prediction models in type-2 diabetes [50.8044927215346]
We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state.
We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability.
Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
arXiv Detail & Related papers (2023-02-11T18:07:11Z) - Human Evaluation and Correlation with Automatic Metrics in Consultation
Note Generation [56.25869366777579]
In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
arXiv Detail & Related papers (2022-04-01T14:04:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.