Related papers: Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy

Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy

URL: http://arxiv.org/abs/2402.08806v1
Date: Tue, 13 Feb 2024 21:24:21 GMT
Title: Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy
Authors: Gioele Barabucci, Victor Shia, Eugene Chu, Benjamin Harack, Nathan Fu
Abstract summary: Large language models (LLMs) are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults" We assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Background: Large language models (LLMs) such as OpenAI's GPT-4 or Google's PaLM 2 are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults". However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for real-life applications. Methods: Using collective intelligence methods and a dataset of 200 clinical vignettes of real-life cases, we assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs. Results: We find that aggregating responses from multiple, various LLMs leads to more accurate differential diagnoses (average accuracy for 3 LLMs: $75.3\%\pm 1.6pp$) compared to the differential diagnoses produced by single LLMs (average accuracy for single LLMs: $59.0\%\pm 6.1pp$). Discussion: The use of collective intelligence methods to synthesize differential diagnoses combining the responses of different LLMs achieves two of the necessary steps towards advancing acceptance of LLMs as a diagnostic support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate dependence on a single commercial vendor.

Related papers

Leveraging LLMs for Predicting Unknown Diagnoses from Clinical Notes [21.43498764977656]
Discharge summaries tend to provide more complete information, which can help infer accurate diagnoses. This study investigates whether large language models (LLMs) can predict implicitly mentioned diagnoses from clinical notes and link them to corresponding medications.
arXiv Detail & Related papers (2025-03-28T02:15:57Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
Language Models And A Second Opinion Use Case: The Pocket Professional [0.0]
This research tests the role of Large Language Models (LLMs) as formal second opinion tools in professional decision-making. The work analyzed 183 challenging medical cases from Medscape over a 20-month period, testing multiple LLMs' performance against crowd-sourced physician responses.
arXiv Detail & Related papers (2024-10-27T23:48:47Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses. We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Embodied Agent Interface: Benchmarking LLMs for Embodied Decision Making [85.24399869971236]
We aim to evaluate Large Language Models (LLMs) for embodied decision making. Existing evaluations tend to rely solely on a final success rate. We propose a generalized interface (Embodied Agent Interface) that supports the formalization of various types of tasks.
arXiv Detail & Related papers (2024-10-09T17:59:00Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints [8.547853819087043]
We evaluate the capability of general LLMs to identify and correct medical errors with multiple prompting strategies. We propose incorporating error-span predictions from a smaller, fine-tuned model in two ways. Our best-performing solution with 8-shot + CoT + hints ranked sixth in the shared task leaderboard.
arXiv Detail & Related papers (2024-05-28T10:20:29Z)
XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare [16.79952669254101]
We develop a novel method for zero-shot/few-shot in-context learning (ICL) using a multi-layered structured prompt. We also explore the efficacy of two communication styles between the user and Large Language Models (LLMs) Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates.
arXiv Detail & Related papers (2024-05-10T06:52:44Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs) Our research aims to transform existing medication recommendation methodologies using LLMs. To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and Symptom Analysis [2.4554686192257424]
Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology. We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions. We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
arXiv Detail & Related papers (2024-01-28T09:25:12Z)
Surpassing GPT-4 Medical Coding with a Two-Stage Approach [1.7014913888753238]
GPT-4 LLM predicts an excessive number of ICD codes for medical coding tasks, leading to high recall but low precision. We introduce LLM-codex, a two-stage approach to predict ICD codes that first generates evidence proposals and then employs an LSTM-based verification stage. Our model is the only approach that simultaneously achieves state-of-the-art results in medical coding accuracy, accuracy on rare codes, and sentence-level evidence identification.
arXiv Detail & Related papers (2023-11-22T23:35:13Z)
Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks. How do we evaluate the capabilities of LLMs to consistently produce factually correct answers? We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.