Combining Insights From Multiple Large Language Models Improves
Diagnostic Accuracy
- URL: http://arxiv.org/abs/2402.08806v1
- Date: Tue, 13 Feb 2024 21:24:21 GMT
- Title: Combining Insights From Multiple Large Language Models Improves
Diagnostic Accuracy
- Authors: Gioele Barabucci, Victor Shia, Eugene Chu, Benjamin Harack, Nathan Fu
- Abstract summary: Large language models (LLMs) are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults"
We assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Background: Large language models (LLMs) such as OpenAI's GPT-4 or Google's
PaLM 2 are proposed as viable diagnostic support tools or even spoken of as
replacements for "curbside consults". However, even LLMs specifically trained
on medical topics may lack sufficient diagnostic accuracy for real-life
applications.
Methods: Using collective intelligence methods and a dataset of 200 clinical
vignettes of real-life cases, we assessed and compared the accuracy of
differential diagnoses obtained by asking individual commercial LLMs (OpenAI
GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the accuracy of
differential diagnoses synthesized by aggregating responses from combinations
of the same LLMs.
Results: We find that aggregating responses from multiple, various LLMs leads
to more accurate differential diagnoses (average accuracy for 3 LLMs:
$75.3\%\pm 1.6pp$) compared to the differential diagnoses produced by single
LLMs (average accuracy for single LLMs: $59.0\%\pm 6.1pp$).
Discussion: The use of collective intelligence methods to synthesize
differential diagnoses combining the responses of different LLMs achieves two
of the necessary steps towards advancing acceptance of LLMs as a diagnostic
support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate
dependence on a single commercial vendor.
Related papers
- Anomaly Detection of Tabular Data Using LLMs [54.470648484612866]
We show that pre-trained large language models (LLMs) are zero-shot batch-level anomaly detectors.
We propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies.
arXiv Detail & Related papers (2024-06-24T04:17:03Z) - Edinburgh Clinical NLP at MEDIQA-CORR 2024: Guiding Large Language Models with Hints [8.547853819087043]
We evaluate the capability of general LLMs to identify and correct medical errors with multiple prompting strategies.
We propose incorporating error-span predictions from a smaller, fine-tuned model in two ways.
Our best-performing solution with 8-shot + CoT + hints ranked sixth in the shared task leaderboard.
arXiv Detail & Related papers (2024-05-28T10:20:29Z) - XAI4LLM. Let Machine Learning Models and LLMs Collaborate for Enhanced In-Context Learning in Healthcare [16.79952669254101]
We develop a novel method for zero-shot/few-shot in-context learning (ICL) using a multi-layered structured prompt.
We also explore the efficacy of two communication styles between the user and Large Language Models (LLMs)
Our study systematically evaluates the diagnostic accuracy and risk factors, including gender bias and false negative rates.
arXiv Detail & Related papers (2024-05-10T06:52:44Z) - Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches [7.3384872719063114]
We develop and refined a series of medical Large Language Models (LLMs) based on the Llama-2 architecture.
Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks.
arXiv Detail & Related papers (2024-04-23T06:36:21Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z) - Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)
Our research aims to transform existing medication recommendation methodologies using LLMs.
To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z) - Evaluating LLM -- Generated Multimodal Diagnosis from Medical Images and
Symptom Analysis [2.4554686192257424]
Large language models (LLMs) constitute a breakthrough state-of-the-art Artificial Intelligence technology.
We evaluate the correctness and accuracy of LLM-generated medical diagnosis with publicly available multimodal multiple-choice questions.
We explored a wide range of diseases, conditions, chemical compounds, and related entity types that are included in the vast knowledge domain of Pathology.
arXiv Detail & Related papers (2024-01-28T09:25:12Z) - Surpassing GPT-4 Medical Coding with a Two-Stage Approach [1.7014913888753238]
GPT-4 LLM predicts an excessive number of ICD codes for medical coding tasks, leading to high recall but low precision.
We introduce LLM-codex, a two-stage approach to predict ICD codes that first generates evidence proposals and then employs an LSTM-based verification stage.
Our model is the only approach that simultaneously achieves state-of-the-art results in medical coding accuracy, accuracy on rare codes, and sentence-level evidence identification.
arXiv Detail & Related papers (2023-11-22T23:35:13Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.