Combining Insights From Multiple Large Language Models Improves
Diagnostic Accuracy
- URL: http://arxiv.org/abs/2402.08806v1
- Date: Tue, 13 Feb 2024 21:24:21 GMT
- Title: Combining Insights From Multiple Large Language Models Improves
Diagnostic Accuracy
- Authors: Gioele Barabucci, Victor Shia, Eugene Chu, Benjamin Harack, Nathan Fu
- Abstract summary: Large language models (LLMs) are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults"
We assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Large language models (LLMs) such as OpenAI's GPT-4 or Google's
PaLM 2 are proposed as viable diagnostic support tools or even spoken of as
replacements for "curbside consults". However, even LLMs specifically trained
on medical topics may lack sufficient diagnostic accuracy for real-life
applications.
Methods: Using collective intelligence methods and a dataset of 200 clinical
vignettes of real-life cases, we assessed and compared the accuracy of
differential diagnoses obtained by asking individual commercial LLMs (OpenAI
GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the accuracy of
differential diagnoses synthesized by aggregating responses from combinations
of the same LLMs.
Results: We find that aggregating responses from multiple, various LLMs leads
to more accurate differential diagnoses (average accuracy for 3 LLMs:
$75.3\%\pm 1.6pp$) compared to the differential diagnoses produced by single
LLMs (average accuracy for single LLMs: $59.0\%\pm 6.1pp$).
Discussion: The use of collective intelligence methods to synthesize
differential diagnoses combining the responses of different LLMs achieves two
of the necessary steps towards advancing acceptance of LLMs as a diagnostic
support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate
dependence on a single commercial vendor.
Related papers
- Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making.
The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z) - MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools.
Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses.
We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z) - Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making.
Existing evaluations tend to rely solely on a final success rate.
We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z) - Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases.
We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases.
We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models.
Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z) - Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints [8.547853819087043]
We evaluate the capability of general LLMs to identify and correct medical errors with multiple prompting strategies.
We propose incorporating error-span predictions from a smaller, fine-tuned model in two ways.
Our best-performing solution with 8-shot + CoT + hints ranked sixth in the shared task leaderboard.
arXiv Detail & Related papers (2024-05-28T10:20:29Z) - XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare [16.79952669254101]
We develop a novel method for zero-shot/few-shot in-context learning (ICL) using a multi-layered structured prompt.
We also explore the efficacy of two communication styles between the user and Large Language Models (LLMs)
Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates.
arXiv Detail & Related papers (2024-05-10T06:52:44Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)
Our research aims to transform existing medication recommendation methodologies using LLMs.
To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z) - Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis [2.4554686192257424]
Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology.
We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions.
We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
arXiv Detail & Related papers (2024-01-28T09:25:12Z) - Surpassing GPT-4 Medical Coding with a Two-Stage Approach [1.7014913888753238]
GPT-4 LLM predicts an excessive number of ICD codes for medical coding tasks, leading to high recall but low precision.
We introduce LLM-codex, a two-stage approach to predict ICD codes that first generates evidence proposals and then employs an LSTM-based verification stage.
Our model is the only approach that simultaneously achieves state-of-the-art results in medical coding accuracy, accuracy on rare codes, and sentence-level evidence identification.
arXiv Detail & Related papers (2023-11-22T23:35:13Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.