K-QA: A Real-World Medical Q&A Benchmark
        - URL: http://arxiv.org/abs/2401.14493v1
- Date: Thu, 25 Jan 2024 20:11:04 GMT
- Title: K-QA: A Real-World Medical Q&A Benchmark
- Authors: Itay Manes, Naama Ronn, David Cohen, Ran Ilan Ber, Zehavi
  Horowitz-Kugler, Gabriel Stanovsky
- Abstract summary: We construct K-QA, a dataset containing 1,212 patient questions originating from real-world conversations held on K Health.
We employ a panel of in-house physicians to answer and manually decompose a subset of K-QA into self-contained statements.
We evaluate several state-of-the-art models, as well as the effect of in-context learning and medically-oriented augmented retrieval schemes.
- Score: 12.636564634626422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Ensuring the accuracy of responses provided by large language models (LLMs)
is crucial, particularly in clinical settings where incorrect information may
directly impact patient health. To address this challenge, we construct K-QA, a
dataset containing 1,212 patient questions originating from real-world
conversations held on K Health (an AI-driven clinical platform). We employ a
panel of in-house physicians to answer and manually decompose a subset of K-QA
into self-contained statements. Additionally, we formulate two NLI-based
evaluation metrics approximating recall and precision: (1) comprehensiveness,
measuring the percentage of essential clinical information in the generated
answer and (2) hallucination rate, measuring the number of statements from the
physician-curated response contradicted by the LLM answer. Finally, we use K-QA
along with these metrics to evaluate several state-of-the-art models, as well
as the effect of in-context learning and medically-oriented augmented retrieval
schemes developed by the authors. Our findings indicate that in-context
learning improves the comprehensiveness of the models, and augmented retrieval
is effective in reducing hallucinations. We make K-QA available to to the
community to spur research into medically accurate NLP applications.
 
      
        Related papers
        - A Dataset for Addressing Patient's Information Needs related to Clinical   Course of Hospitalization [15.837772594006038]
 ArchEHR-QA is an expert-annotated dataset based on real-world patient cases from intensive care unit and emergency department settings.<n>Cases comprise questions posed by patients to public health forums, clinician-interpreted counterparts, relevant clinical note excerpts with sentence-level relevance annotations, and clinician-authored answers.<n>The answer-first prompting approach consistently performed best, with Llama 4 achieving the highest scores.
 arXiv  Detail & Related papers  (2025-06-04T16:55:08Z)
- Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
 Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
 arXiv  Detail & Related papers  (2025-03-05T05:24:55Z)
- EchoQA: A Large Collection of Instruction Tuning Data for Echocardiogram   Reports [0.0]
 We introduce a novel question-answering (QA) dataset using echocardiogram reports sourced from the Medical Information Mart for Intensive Care database.
This dataset is specifically designed to enhance QA systems in cardiology, consisting of 771,244 QA pairs addressing a wide array of cardiac abnormalities and their severity.
We compare large language models (LLMs), including open-source and biomedical-specific models for zero-shot evaluation, and closed-source models for zero-shot and three-shot evaluation.
 arXiv  Detail & Related papers  (2025-03-04T07:45:45Z)
- Clinical QA 2.0: Multi-Task Learning for Answer Extraction and   Categorization [2.380499804323775]
 We introduce a Multi-Task Learning framework that jointly trains CQA models for both answer extraction and medical categorization.
In addition to predicting answer spans, our model classifies responses into five standardized medical categories: Diagnosis, Medication, Symptoms, Procedure, and Lab Reports.
Results show that MTL improves F1-score by 2.2% compared to standard fine-tuning, while achieving 90.7% accuracy in answer categorization.
 arXiv  Detail & Related papers  (2025-02-18T18:20:37Z)
- LlaMADRS: Prompting Large Language Models for Interview-Based Depression   Assessment [75.44934940580112]
 This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.
We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.
Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
 arXiv  Detail & Related papers  (2025-01-07T08:49:04Z)
- Give me Some Hard Questions: Synthetic Data Generation for Clinical QA [13.436187152293515]
 This paper explores generating Clinical QA data using large language models (LLMs) in a zero-shot setting.
We find that naive prompting often results in easy questions that do not reflect the complexity of clinical scenarios.
Experiments on two Clinical QA datasets demonstrate that our method generates more challenging questions, significantly improving fine-tuning performance over baselines.
 arXiv  Detail & Related papers  (2024-12-05T19:35:41Z)
- Reasoning-Enhanced Healthcare Predictions with Knowledge Graph Community   Retrieval [61.70489848327436]
 KARE is a novel framework that integrates knowledge graph (KG) community-level retrieval with large language models (LLMs) reasoning.
Extensive experiments demonstrate that KARE outperforms leading models by up to 10.8-15.0% on MIMIC-III and 12.6-12.7% on MIMIC-IV for mortality and readmission predictions.
 arXiv  Detail & Related papers  (2024-10-06T18:46:28Z)
- RealMedQA: A pilot biomedical question answering dataset containing   realistic clinical questions [3.182594503527438]
 We present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM.
We show that the LLM is more cost-efficient for generating "ideal" QA pairs.
 arXiv  Detail & Related papers  (2024-08-16T09:32:43Z)
- Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
 We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
 arXiv  Detail & Related papers  (2024-04-25T15:51:06Z)
- EHRNoteQA: An LLM Benchmark for Real-World Clinical Practice Using   Discharge Summaries [9.031182965159976]
 Large Language Models (LLMs) show promise in efficiently analyzing vast and complex data.
We introduce EHRNoteQA, a novel benchmark built on the MIMIC-IV EHR, comprising 962 different QA pairs each linked to distinct patients' discharge summaries.
 EHRNoteQA includes questions that require information across multiple discharge summaries and covers eight diverse topics, mirroring the complexity and diversity of real clinical inquiries.
 arXiv  Detail & Related papers  (2024-02-25T09:41:50Z)
- AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical   Interaction Simulator [69.51568871044454]
 We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
 arXiv  Detail & Related papers  (2024-02-15T06:46:48Z)
- Zero-Shot Clinical Trial Patient Matching with LLMs [40.31971412825736]
 Large language models (LLMs) offer a promising solution to automated screening.
We design an LLM-based system which, given a patient's medical history as unstructured clinical text, evaluates whether that patient meets a set of inclusion criteria.
Our system achieves state-of-the-art scores on the n2c2 2018 cohort selection benchmark.
 arXiv  Detail & Related papers  (2024-02-05T00:06:08Z)
- SPeC: A Soft Prompt-Based Calibration on Performance Variability of
  Large Language Model in Clinical Notes Summarization [50.01382938451978]
 We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
 Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
 arXiv  Detail & Related papers  (2023-03-23T04:47:46Z)
- Informing clinical assessment by contextualizing post-hoc explanations
  of risk prediction models in type-2 diabetes [50.8044927215346]
 We consider a comorbidity risk prediction scenario and focus on contexts regarding the patients clinical state.
We employ several state-of-the-art LLMs to present contexts around risk prediction model inferences and evaluate their acceptability.
Our paper is one of the first end-to-end analyses identifying the feasibility and benefits of contextual explanations in a real-world clinical use case.
 arXiv  Detail & Related papers  (2023-02-11T18:07:11Z)
- Large Language Models Encode Clinical Knowledge [21.630872464930587]
 Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation.
We propose a framework for human evaluation of model answers along multiple axes including factuality, precision, possible harm, and bias.
We show that comprehension, recall of knowledge, and medical reasoning improve with model scale and instruction prompt tuning.
 arXiv  Detail & Related papers  (2022-12-26T14:28:24Z)
- Human Evaluation and Correlation with Automatic Metrics in Consultation
  Note Generation [56.25869366777579]
 In recent years, machine learning models have rapidly become better at generating clinical consultation notes.
We present an extensive human evaluation study where 5 clinicians listen to 57 mock consultations, write their own notes, post-edit a number of automatically generated notes, and extract all the errors.
We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore.
 arXiv  Detail & Related papers  (2022-04-01T14:04:16Z)
- Attention-based Aspect Reasoning for Knowledge Base Question Answering
  on Clinical Notes [12.831807443341214]
 We aim at creating knowledge base from clinical notes to link different patients and clinical notes, and performing knowledge base question answering (KBQA)
Based on the expert annotations in n2c2, we first created the ClinicalKBQA dataset that includes 8,952 QA pairs and covers questions about seven medical topics through 322 question templates.
We propose an attention-based aspect reasoning (AAR) method for KBQA and investigated the impact of different aspects of answers for prediction.
 arXiv  Detail & Related papers  (2021-08-01T17:58:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.