Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
- URL: http://arxiv.org/abs/2403.20288v2
- Date: Mon, 6 May 2024 14:13:51 GMT
- Title: Can LLMs Correct Physicians, Yet? Investigating Effective Interaction Methods in the Medical Domain
- Authors: Burcu Sayin, Pasquale Minervini, Jacopo Staiano, Andrea Passerini,
- Abstract summary: Large Language Models (LLMs) can assist and potentially correct physicians in medical decision-making tasks.
We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios.
- Score: 21.96129653695565
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We explore the potential of Large Language Models (LLMs) to assist and potentially correct physicians in medical decision-making tasks. We evaluate several LLMs, including Meditron, Llama2, and Mistral, to analyze the ability of these models to interact effectively with physicians across different scenarios. We consider questions from PubMedQA and several tasks, ranging from binary (yes/no) responses to long answer generation, where the answer of the model is produced after an interaction with a physician. Our findings suggest that prompt design significantly influences the downstream accuracy of LLMs and that LLMs can provide valuable feedback to physicians, challenging incorrect diagnoses and contributing to more accurate decision-making. For example, when the physician is accurate 38% of the time, Mistral can produce the correct answer, improving accuracy up to 74% depending on the prompt being used, while Llama2 and Meditron models exhibit greater sensitivity to prompt choice. Our analysis also uncovers the challenges of ensuring that LLM-generated suggestions are pertinent and useful, emphasizing the need for further research in this area.
Related papers
- LLMs for Doctors: Leveraging Medical LLMs to Assist Doctors, Not Replace Them [41.65016162783525]
We focus on tuning the Large Language Models to be medical assistants who collaborate with more experienced doctors.
We construct a Chinese medical dataset called DoctorFLAN to support the entire workflow of doctors.
We evaluate LLMs in doctor-oriented scenarios by constructing the DoctorFLAN-textittest containing 550 single-turn Q&A and DotaBench containing 74 multi-turn conversations.
arXiv Detail & Related papers (2024-06-26T03:08:24Z) - MEDIQ: Question-Asking LLMs for Adaptive and Reliable Clinical Reasoning [36.400896909161006]
In high-stakes domains like clinical reasoning, AI assistants powered by large language models (LLMs) are yet to be reliable and safe.
We propose to develop more careful LLMs that ask follow-up questions to gather necessary and sufficient information and respond reliably.
We introduce MEDIQ, a framework to simulate realistic clinical interactions.
arXiv Detail & Related papers (2024-06-03T01:32:52Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM [48.16696073640864]
We introduce OmniMedVQA, a novel comprehensive medical Visual Question Answering (VQA) benchmark.
All images in this benchmark are sourced from authentic medical scenarios.
We have found that existing LVLMs struggle to address these medical VQA problems effectively.
arXiv Detail & Related papers (2024-02-14T13:51:56Z) - Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)
Our research aims to transform existing medication recommendation methodologies using LLMs.
To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z) - A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language.
This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z) - MKRAG: Medical Knowledge Retrieval Augmented Generation for Medical Question Answering [42.528771319248214]
Large Language Models (LLMs) often perform poorly on domain-specific tasks like medical question answering (QA)
We propose a comprehensive retrieval strategy to extract medical facts from an external knowledge base, and then inject them into the query prompt for LLMs.
Our retrieval-augmented Vicuna-7B model exhibited an accuracy improvement from 44.46% to 48.54%.
arXiv Detail & Related papers (2023-09-27T21:26:03Z) - Augmenting Black-box LLMs with Medical Textbooks for Clinical Question
Answering [54.13933019557655]
We present a system called LLMs Augmented with Medical Textbooks (LLM-AMT)
LLM-AMT integrates authoritative medical textbooks into the LLMs' framework using plug-and-play modules.
We found that medical textbooks as a retrieval corpus is proven to be a more effective knowledge database than Wikipedia in the medical domain.
arXiv Detail & Related papers (2023-09-05T13:39:38Z) - An Automatic Evaluation Framework for Multi-turn Medical Consultations
Capabilities of Large Language Models [22.409334091186995]
Large language models (LLMs) often suffer from hallucinations, leading to overly confident but incorrect judgments.
This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations.
arXiv Detail & Related papers (2023-09-05T09:24:48Z) - Diagnostic Reasoning Prompts Reveal the Potential for Large Language
Model Interpretability in Medicine [4.773117448586697]
We develop novel diagnostic reasoning prompts to study whether large language models (LLMs) can perform clinical reasoning to accurately form a diagnosis.
We find GPT4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy.
arXiv Detail & Related papers (2023-08-13T19:04:07Z) - SPeC: A Soft Prompt-Based Calibration on Performance Variability of
Large Language Model in Clinical Notes Summarization [50.01382938451978]
We introduce a model-agnostic pipeline that employs soft prompts to diminish variance while preserving the advantages of prompt-based summarization.
Experimental findings indicate that our method not only bolsters performance but also effectively curbs variance for various language models.
arXiv Detail & Related papers (2023-03-23T04:47:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.