Performance of Large Language Models in Answering Critical Care Medicine Questions
- URL: http://arxiv.org/abs/2509.19344v1
- Date: Tue, 16 Sep 2025 14:46:34 GMT
- Title: Performance of Large Language Models in Answering Critical Care Medicine Questions
- Authors: Mahmoud Alwakeel, Aditya Nagori, An-Kwok Ian Wong, Neal Chaisson, Vijay Krishnamoorthy, Rishikesan Kamaleswaran,
- Abstract summary: Large Language Models tested on 871 Critical Care Medicine questions.<n>Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy.
- Score: 1.825224193230824
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models have been tested on medical student-level questions, but their performance in specialized fields like Critical Care Medicine (CCM) is less explored. This study evaluated Meta-Llama 3.1 models (8B and 70B parameters) on 871 CCM questions. Llama3.1:70B outperformed 8B by 30%, with 60% average accuracy. Performance varied across domains, highest in Research (68.4%) and lowest in Renal (47.9%), highlighting the need for broader future work to improve models across various subspecialty domains.
Related papers
- PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z) - Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation [0.0]
This paper compares five Large Language Models (LLMs) deployed between April 2024 and August 2025 for medical QA.<n>Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini.<n>Results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks.
arXiv Detail & Related papers (2026-02-16T08:53:23Z) - A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z) - Generalist Foundation Models Are Not Clinical Enough for Hospital Operations [29.539795338917983]
We introduce Lang1, a family of models pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet.<n>To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes.<n>Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively.
arXiv Detail & Related papers (2025-11-17T18:52:22Z) - 47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations [10.072653135781207]
This paper presents a benchmark evaluation of 27 large language models (LLMs) on Chinese medical examination questions.<n>Our analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%.<n>The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions.
arXiv Detail & Related papers (2025-11-16T06:08:41Z) - Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z) - Agentic large language models improve retrieval-based radiology question answering [4.208637377704778]
We propose an agentic RAG framework enabling large language models (LLMs) to autonomously decompose radiology questions.<n>LLMs iteratively retrieve targeted clinical evidence from Radiopaedia.org, and dynamically synthesize evidence-based responses.<n>Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG.
arXiv Detail & Related papers (2025-08-01T16:18:52Z) - ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z) - MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models [42.13371892174481]
We compare medical large language models (LLMs) and vision-language models (VLMs) against their corresponding base models.<n>Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities.
arXiv Detail & Related papers (2024-11-13T18:50:13Z) - Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data [3.469567586411153]
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data.
This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks.
arXiv Detail & Related papers (2024-08-25T13:36:22Z) - MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama.<n>It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2.<n>Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z) - Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters.
Our systems achieved remarkable accuracy across six medical benchmarks.
Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.