Related papers: Performance of Large Language Models in Answering Critical Care Medicine Questions

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation [0.0]
This paper compares five Large Language Models (LLMs) deployed between April 2024 and August 2025 for medical QA.<n>Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini.<n>Results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks.
arXiv Detail & Related papers (2026-02-16T08:53:23Z)
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
Generalist Foundation Models Are Not Clinical Enough for Hospital Operations [29.539795338917983]
We introduce Lang1, a family of models pretrained on a specialized corpus blending 80B clinical tokens from NYU Langone Health's EHRs and 627B tokens from the internet.<n>To rigorously evaluate Lang1 in real-world settings, we developed the REalistic Medical Evaluation (ReMedE), a benchmark derived from 668,331 EHR notes.<n>Lang1-1B outperforms finetuned generalist models up to 70x larger and zero-shot models up to 671x larger, improving AUROC by 3.64%-6.75% and 1.66%-23.66% respectively.
arXiv Detail & Related papers (2025-11-17T18:52:22Z)
47B Mixture-of-Experts Beats 671B Dense Models on Chinese Medical Examinations [10.072653135781207]
This paper presents a benchmark evaluation of 27 large language models (LLMs) on Chinese medical examination questions.<n>Our analysis reveals substantial performance variations among models, with Mixtral-8x7B achieving the highest overall accuracy of 74.25%.<n>The evaluation demonstrates significant performance gaps between medical specialties, with models generally performing better on cardiovascular and neurology questions.
arXiv Detail & Related papers (2025-11-16T06:08:41Z)
Med-RewardBench: Benchmarking Reward Models and Judges for Medical Multimodal Large Language Models [57.73472878679636]
We introduce Med-RewardBench, the first benchmark specifically designed to evaluate medical reward models and judges.<n>Med-RewardBench features a multimodal dataset spanning 13 organ systems and 8 clinical departments, with 1,026 expert-annotated cases.<n>A rigorous three-step process ensures high-quality evaluation data across six clinically critical dimensions.
arXiv Detail & Related papers (2025-08-29T08:58:39Z)
Agentic large language models improve retrieval-based radiology question answering [4.208637377704778]
We propose an agentic RAG framework enabling large language models (LLMs) to autonomously decompose radiology questions.<n>LLMs iteratively retrieve targeted clinical evidence from Radiopaedia.org, and dynamically synthesize evidence-based responses.<n>Agentic retrieval significantly improved mean diagnostic accuracy over zero-shot prompting and conventional online RAG.
arXiv Detail & Related papers (2025-08-01T16:18:52Z)
ReasonMed: A 370K Multi-Agent Generated Dataset for Advancing Medical Reasoning [54.30630356786752]
ReasonMed is the largest medical reasoning dataset to date, with 370k high-quality examples.<n>It is built through a multi-agent generation, verification, and refinement process.<n>Using ReasonMed, we find that integrating detailed CoT reasoning with concise answer summaries yields the most robust fine-tuning results.
arXiv Detail & Related papers (2025-06-11T08:36:55Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
The Limited Impact of Medical Adaptation of Large Language and Vision-Language Models [42.13371892174481]
We compare medical large language models (LLMs) and vision-language models (VLMs) against their corresponding base models.<n>Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities.
arXiv Detail & Related papers (2024-11-13T18:50:13Z)
Biomedical Large Languages Models Seem not to be Superior to Generalist Models on Unseen Medical Data [3.469567586411153]
Large language models (LLMs) have shown potential in biomedical applications, leading to efforts to fine-tune them on domain-specific data. This study evaluates the performance of biomedically fine-tuned LLMs against their general-purpose counterparts on a variety of clinical tasks.
arXiv Detail & Related papers (2024-08-25T13:36:22Z)
MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama.<n>It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2.<n>Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z)
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. Our systems achieved remarkable accuracy across six medical benchmarks. Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.