Question answering systems for health professionals at the point of care
-- a systematic review
- URL: http://arxiv.org/abs/2402.01700v1
- Date: Wed, 24 Jan 2024 13:47:39 GMT
- Title: Question answering systems for health professionals at the point of care
-- a systematic review
- Authors: Gregory Kell, Angus Roberts, Serge Umansky, Linglong Qian, Davide
Ferrari, Frank Soboczenski, Byron Wallace, Nikhil Patel, Iain J Marshall
- Abstract summary: Question answering (QA) systems have the potential to improve the quality of clinical care by providing health professionals with the latest and most relevant evidence.
This systematic review aims to characterize current medical QA systems, assess their suitability for healthcare, and identify areas of improvement.
- Score: 2.446313557261822
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Objective: Question answering (QA) systems have the potential to improve the
quality of clinical care by providing health professionals with the latest and
most relevant evidence. However, QA systems have not been widely adopted. This
systematic review aims to characterize current medical QA systems, assess their
suitability for healthcare, and identify areas of improvement.
Materials and methods: We searched PubMed, IEEE Xplore, ACM Digital Library,
ACL Anthology and forward and backward citations on 7th February 2023. We
included peer-reviewed journal and conference papers describing the design and
evaluation of biomedical QA systems. Two reviewers screened titles, abstracts,
and full-text articles. We conducted a narrative synthesis and risk of bias
assessment for each study. We assessed the utility of biomedical QA systems.
Results: We included 79 studies and identified themes, including question
realism, answer reliability, answer utility, clinical specialism, systems,
usability, and evaluation methods. Clinicians' questions used to train and
evaluate QA systems were restricted to certain sources, types and complexity
levels. No system communicated confidence levels in the answers or sources.
Many studies suffered from high risks of bias and applicability concerns. Only
8 studies completely satisfied any criterion for clinical utility, and only 7
reported user evaluations. Most systems were built with limited input from
clinicians.
Discussion: While machine learning methods have led to increased accuracy,
most studies imperfectly reflected real-world healthcare information needs. Key
research priorities include developing more realistic healthcare QA datasets
and considering the reliability of answer sources, rather than merely focusing
on accuracy.
Related papers
- Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.
Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.
We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z) - Bias Evaluation and Mitigation in Retrieval-Augmented Medical Question-Answering Systems [4.031787614742573]
This study systematically evaluates demographic biases within medical RAG pipelines across multiple QA benchmarks.
We implement and compare several bias mitigation strategies to address identified biases, including Chain of Thought reasoning, Counterfactual filtering, Adversarial prompt refinement, and Majority Vote aggregation.
arXiv Detail & Related papers (2025-03-19T17:36:35Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.
We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.
Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.
We propose a novel approach utilizing structured medical reasoning.
Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - Uncertainty-aware abstention in medical diagnosis based on medical texts [87.88110503208016]
This study addresses the critical issue of reliability for AI-assisted medical diagnosis.
We focus on the selection prediction approach that allows the diagnosis system to abstain from providing the decision if it is not confident in the diagnosis.
We introduce HUQ-2, a new state-of-the-art method for enhancing reliability in selective prediction tasks.
arXiv Detail & Related papers (2025-02-25T10:15:21Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - RealMedQA: A pilot biomedical question answering dataset containing realistic clinical questions [3.182594503527438]
We present RealMedQA, a dataset of realistic clinical questions generated by humans and an LLM.
We show that the LLM is more cost-efficient for generating "ideal" QA pairs.
arXiv Detail & Related papers (2024-08-16T09:32:43Z) - Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic.
We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks.
We then construct six novel datasets and clinical tasks that are complex but common in real-world practice.
We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z) - A survey of recent methods for addressing AI fairness and bias in
biomedicine [48.46929081146017]
Artificial intelligence systems may perpetuate social inequities or demonstrate biases, such as those based on race or gender.
We surveyed recent publications on different debiasing methods in the fields of biomedical natural language processing (NLP) or computer vision (CV)
We performed a literature search on PubMed, ACM digital library, and IEEE Xplore of relevant articles published between January 2018 and December 2023 using multiple combinations of keywords.
We reviewed other potential methods from the general domain that could be applied to biomedicine to address bias and improve fairness.
arXiv Detail & Related papers (2024-02-13T06:38:46Z) - Designing Interpretable ML System to Enhance Trust in Healthcare: A Systematic Review to Proposed Responsible Clinician-AI-Collaboration Framework [13.215318138576713]
The paper reviews interpretable AI processes, methods, applications, and the challenges of implementation in healthcare.
It aims to foster a comprehensive understanding of the crucial role of a robust interpretability approach in healthcare.
arXiv Detail & Related papers (2023-11-18T12:29:18Z) - Emulating Human Cognitive Processes for Expert-Level Medical
Question-Answering with Large Language Models [0.23463422965432823]
BooksMed is a novel framework based on a Large Language Model (LLM)
It emulates human cognitive processes to deliver evidence-based and reliable responses.
We present ExpertMedQA, a benchmark comprised of open-ended, expert-level clinical questions.
arXiv Detail & Related papers (2023-10-17T13:39:26Z) - Medical Question Understanding and Answering with Knowledge Grounding
and Semantic Self-Supervision [53.692793122749414]
We introduce a medical question understanding and answering system with knowledge grounding and semantic self-supervision.
Our system is a pipeline that first summarizes a long, medical, user-written question, using a supervised summarization loss.
The system first matches the summarized user question with an FAQ from a trusted medical knowledge base, and then retrieves a fixed number of relevant sentences from the corresponding answer document.
arXiv Detail & Related papers (2022-09-30T08:20:32Z) - What Would it Take to get Biomedical QA Systems into Practice? [21.339520766920092]
Medical question answering (QA) systems have the potential to answer clinicians uncertainties about treatment and diagnosis on demand.
Despite the significant progress in general QA made by the NLP community, medical QA systems are still not widely used in clinical environments.
arXiv Detail & Related papers (2021-09-21T19:39:42Z) - Image Based Artificial Intelligence in Wound Assessment: A Systematic
Review [0.0]
Assessment of acute and chronic wounds can help wound care teams improve diagnosis, optimize treatment plans, ease the workload and achieve health-related quality of life to the patient population.
While artificial intelligence has found wide applications in health-related sciences and technology, AI-based systems remain to be developed clinically and computationally for high-quality wound care.
arXiv Detail & Related papers (2020-09-15T14:52:14Z) - Interpretable Multi-Step Reasoning with Knowledge Extraction on Complex
Healthcare Question Answering [89.76059961309453]
HeadQA dataset contains multiple-choice questions authorized for the public healthcare specialization exam.
These questions are the most challenging for current QA systems.
We present a Multi-step reasoning with Knowledge extraction framework (MurKe)
We are striving to make full use of off-the-shelf pre-trained models.
arXiv Detail & Related papers (2020-08-06T02:47:46Z) - Opportunities of a Machine Learning-based Decision Support System for
Stroke Rehabilitation Assessment [64.52563354823711]
Rehabilitation assessment is critical to determine an adequate intervention for a patient.
Current practices of assessment mainly rely on therapist's experience, and assessment is infrequently executed due to the limited availability of a therapist.
We developed an intelligent decision support system that can identify salient features of assessment using reinforcement learning.
arXiv Detail & Related papers (2020-02-27T17:04:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.