Related papers: A Super-Learner with Large Language Models for Medical Emergency Advising

A Super-Learner with Large Language Models for Medical Emergency Advising

URL: http://arxiv.org/abs/2511.08614v1
Date: Thu, 13 Nov 2025 01:00:42 GMT
Title: A Super-Learner with Large Language Models for Medical Emergency Advising
Authors: Sergey K. Aityan, Abdolreza Mosaddegh, Rolando Herrero, Haitham Tayyar, Jiang Han, Vikram Sawant, Qi Chen, Rishabh Jain, Aruna Senthamaraikannan, Stephen Wood, Manuel Mersini, Rita Lazzaro, Mario Balzaneli, Nicola Iacovazzo, Ciro Gargiulo Isacco,
Abstract summary: Large Language Models (LLMs) have been employed in various fields of medical decision-support systems.<n>We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs.<n>The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner.
Score: 6.918114949279224
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Medical decision-support and advising systems are critical for emergency physicians to quickly and accurately assess patients' conditions and make diagnosis. Artificial Intelligence (AI) has emerged as a transformative force in healthcare in recent years and Large Language Models (LLMs) have been employed in various fields of medical decision-support systems. We studied responses of a group of different LLMs to real cases in emergency medicine. The results of our study on five most renown LLMs showed significant differences in capabilities of Large Language Models for diagnostics acute diseases in medical emergencies with accuracy ranging between 58% and 65%. This accuracy significantly exceeds the reported accuracy of human doctors. We built a super-learner MEDAS (Medical Emergency Diagnostic Advising System) of five major LLMs - Gemini, Llama, Grok, GPT, and Claude). The super-learner produces higher diagnostic accuracy, 70%, even with a quite basic meta-learner. However, at least one of the integrated LLMs in the same super-learner produces 85% correct diagnoses. The super-learner integrates a cluster of LLMs using a meta-learner capable of learning different capabilities of each LLM to leverage diagnostic accuracy of the model by collective capabilities of all LLMs in the cluster. The results of our study showed that aggregated diagnostic accuracy provided by a meta-learning approach exceeds that of any individual LLM, suggesting that the super-learner can take advantage of the combined knowledge of the medical datasets used to train the group of LLMs.

Related papers

A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
MAM: Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis via Role-Specialized Collaboration [57.98393950821579]
We introduce the Modular Multi-Agent Framework for Multi-Modal Medical Diagnosis (MAM)<n>Inspired by our empirical findings, MAM decomposes the medical diagnostic process into specialized roles: a General Practitioner, Specialist Team, Radiologist, Medical Assistant, and Director.<n>This modular and collaborative framework enables efficient knowledge updates and leverages existing medical LLMs and knowledge bases.
arXiv Detail & Related papers (2025-06-24T17:52:43Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
Improving Interactive Diagnostic Ability of a Large Language Model Agent Through Clinical Experience Learning [17.647875658030006]
This study investigates the underlying mechanisms behind the performance degradation phenomenon.<n>We developed a plug-and-play method enhanced (PPME) LLM agent, leveraging over 3.5 million electronic medical records from Chinese and American healthcare facilities.<n>Our approach integrates specialized models for initial disease diagnosis and inquiry into the history of the present illness, trained through supervised and reinforcement learning techniques.
arXiv Detail & Related papers (2025-02-24T06:24:20Z)
Fact or Guesswork? Evaluating Large Language Models' Medical Knowledge with Structured One-Hop Judgments [108.55277188617035]
Large language models (LLMs) have been widely adopted in various downstream task domains, but their abilities to directly recall and apply factual medical knowledge remains under-explored.<n>We introduce the Medical Knowledge Judgment dataset (MKJ), a dataset derived from the Unified Medical Language System (UMLS), a comprehensive repository of standardized vocabularies and knowledge graphs.<n>Through a binary classification framework, MKJ evaluates LLMs' grasp of fundamental medical facts by having them assess the validity of concise, one-hop statements.
arXiv Detail & Related papers (2025-02-20T05:27:51Z)
Human-AI collectives produce the most accurate differential diagnoses [0.0]
We show that hybrid collectives of physicians and large language models (LLMs) outperform both single physicians and physician collectives. Our approach highlights the potential for collective human and machine intelligence to improve accuracy in complex, open-ended domains like medical diagnostics.
arXiv Detail & Related papers (2024-06-21T08:46:30Z)
A Survey on Large Language Models from General Purpose to Medical Applications: Datasets, Methodologies, and Evaluations [5.265452667976959]
This survey systematically summarizes how to train medical LLMs based on open-source general LLMs. It covers (a) how to acquire training corpus and construct customized medical training sets, (b) how to choose an appropriate training paradigm, and (d) existing challenges and promising research directions.
arXiv Detail & Related papers (2024-06-14T02:42:20Z)
Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses. GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy [0.0]
Large language models (LLMs) are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults" We assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs.
arXiv Detail & Related papers (2024-02-13T21:24:21Z)
A Survey of Large Language Models in Medicine: Progress, Application, and Challenge [85.09998659355038]
Large language models (LLMs) have received substantial attention due to their capabilities for understanding and generating human language. This review aims to provide a detailed overview of the development and deployment of LLMs in medicine.
arXiv Detail & Related papers (2023-11-09T02:55:58Z)
Quantifying Self-diagnostic Atomic Knowledge in Chinese Medical Foundation Model: A Computational Analysis [55.742339781494046]
Foundation Models (FMs) have the potential to revolutionize the way users self-diagnose through search engines by offering direct and efficient suggestions. Recent studies primarily focused on the quality of FMs evaluated by GPT-4 or their ability to pass medical exams. No studies have quantified the extent of self-diagnostic atomic knowledge stored in FMs' memory.
arXiv Detail & Related papers (2023-10-18T05:42:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.