Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support
- URL: http://arxiv.org/abs/2601.03266v1
- Date: Thu, 18 Dec 2025 22:29:45 GMT
- Title: Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support
- Authors: Alif Munim, Jun Ma, Omar Ibrahim, Alhusain Abdalla, Shuolin Yin, Leo Chen, Bo Wang,
- Abstract summary: Large language models (LLMs) have rapidly advanced in clinical decision-making.<n>Yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure.
- Score: 3.165122193962168
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have rapidly advanced in clinical decision-making, yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure. Open-source alternatives allow local inference but often require large model sizes that limit their use in resource-constrained clinical settings. Here, we benchmark two on-device LLMs, gpt-oss-20b and gpt-oss-120b, across three representative clinical tasks: general disease diagnosis, specialty-specific (ophthalmology) diagnosis and management, and simulation of human expert grading and evaluation. We compare their performance with state-of-the-art proprietary models (GPT-5 and o4-mini) and a leading open-source model (DeepSeek-R1), and we further evaluate the adaptability of on-device systems by fine-tuning gpt-oss-20b on general diagnostic data. Across tasks, gpt-oss models achieve performance comparable to or exceeding DeepSeek-R1 and o4-mini despite being substantially smaller. In addition, fine-tuning remarkably improves the diagnostic accuracy of gpt-oss-20b, enabling it to approach the performance of GPT-5. These findings highlight the potential of on-device LLMs to deliver accurate, adaptable, and privacy-preserving clinical decision support, offering a practical pathway for broader integration of LLMs into routine clinical practice.
Related papers
- A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - MMedExpert-R1: Strengthening Multimodal Medical Reasoning via Domain-Specific Adaptation and Clinical Guideline Reinforcement [63.82954136824963]
Medical Vision-Language Models excel at perception tasks with complex clinical reasoning required in real-world scenarios.<n>We propose a novel reasoning MedVLM that addresses these challenges through domain-specific adaptation and guideline reinforcement.
arXiv Detail & Related papers (2026-01-16T02:32:07Z) - Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z) - EndoBench: A Comprehensive Evaluation of Multi-Modal Large Language Models for Endoscopy Analysis [62.00431604976949]
EndoBench is the first comprehensive benchmark specifically designed to assess MLLMs across the full spectrum of endoscopic practice.<n>We benchmark 23 state-of-the-art models, including general-purpose, medical-specialized, and proprietary MLLMs.<n>Our experiments reveal: proprietary MLLMs outperform open-source and medical-specialized models overall, but still trail human experts.
arXiv Detail & Related papers (2025-05-29T16:14:34Z) - A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment [46.776978552161395]
Small language models (SLMs) offer a cost-effective alternative to large language models such as GPT-4.<n>SLMs offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation.<n>We propose a novel framework for adapting SLMs into high-performing clinical models.
arXiv Detail & Related papers (2025-05-15T21:40:21Z) - In-Context Learning for Label-Efficient Cancer Image Classification in Oncology [1.741659712094955]
In-context learning (ICL) is a pragmatic alternative to model retraining for domain-specific diagnostic tasks.<n>We evaluated the performance of four vision-language models (VLMs)-Paligemma, CLIP, ALIGN and GPT-4o.<n>ICL demonstrated competitive gains despite their smaller size, suggesting feasibility for deployment in computing constrained clinical environments.
arXiv Detail & Related papers (2025-05-08T20:49:01Z) - ClinicRealm: Re-evaluating Large Language Models with Conventional Machine Learning for Non-Generative Clinical Prediction Tasks [37.544994716002016]
Large Language Models (LLMs) are increasingly deployed in medicine.<n>However, their utility in non-generative clinical prediction remains under-evaluated.<n>Our ClinicRealm study addresses this by benchmarking 15 GPT-style LLMs, 5 BERT-style models, and 11 traditional methods.
arXiv Detail & Related papers (2024-07-26T06:09:10Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.