Related papers: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment

URL: http://arxiv.org/abs/2505.10717v2
Date: Wed, 21 May 2025 17:36:21 GMT
Title: A Modular Approach for Clinical SLMs Driven by Synthetic Data with Pre-Instruction Tuning, Model Merging, and Clinical-Tasks Alignment
Authors: Jean-Philippe Corbeil, Amin Dada, Jean-Michel Attendu, Asma Ben Abacha, Alessandro Sordoni, Lucas Caccia, François Beaulieu, Thomas Lin, Jens Kleesiek, Paul Vozila,
Abstract summary: Small language models (SLMs) offer a cost-effective alternative to large language models such as GPT-4.<n>SLMs offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation.<n>We propose a novel framework for adapting SLMs into high-performing clinical models.
Score: 46.776978552161395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High computation costs and latency of large language models such as GPT-4 have limited their deployment in clinical settings. Small language models (SLMs) offer a cost-effective alternative, but their limited capacity requires biomedical domain adaptation, which remains challenging. An additional bottleneck is the unavailability and high sensitivity of clinical data. To address these challenges, we propose a novel framework for adapting SLMs into high-performing clinical models. We introduce the MediPhi collection of 3.8B-parameter SLMs developed with our novel framework: pre-instruction tuning of experts on relevant medical and clinical corpora (PMC, Medical Guideline, MedWiki, etc.), model merging, and clinical-tasks alignment. To cover most clinical tasks, we extended the CLUE benchmark to CLUE+, doubling its size. Our expert models deliver relative improvements on this benchmark over the base model without any task-specific fine-tuning: 64.3% on medical entities, 49.5% on radiology reports, and 44% on ICD-10 coding (outperforming GPT-4-0125 by 14%). We unify the expert models into MediPhi via model merging, preserving gains across benchmarks. Furthermore, we built the MediFlow collection, a synthetic dataset of 2.5 million high-quality instructions on 14 medical NLP tasks, 98 fine-grained document types, and JSON format support. Alignment of MediPhi using supervised fine-tuning and direct preference optimization achieves further gains of 18.9% on average.

Related papers

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation [0.0]
This paper compares five Large Language Models (LLMs) deployed between April 2024 and August 2025 for medical QA.<n>Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini.<n>Results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks.
arXiv Detail & Related papers (2026-02-16T08:53:23Z)
A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z)
Benchmarking and Adapting On-Device Large Language Models for Clinical Decision Support [3.165122193962168]
Large language models (LLMs) have rapidly advanced in clinical decision-making.<n>Yet the deployment of proprietary systems is hindered by privacy concerns and reliance on cloud-based infrastructure.
arXiv Detail & Related papers (2025-12-18T22:29:45Z)
A systematic review of trial-matching pipelines using large language models [0.9176056742068814]
Matching patients to clinical trial options is critical for identifying novel treatments, especially in oncology.<n>Large language models (LLMs) offer a promising solution to this problem.<n>This review synthesizes progress in applying LLMs to clinical trial matching, highlighting promising directions and key limitations.
arXiv Detail & Related papers (2025-09-13T21:21:05Z)
MAST-Pro: Dynamic Mixture-of-Experts for Adaptive Segmentation of Pan-Tumors with Knowledge-Driven Prompts [54.915060471994686]
We propose MAST-Pro, a novel framework that integrates dynamic Mixture-of-Experts (D-MoE) and knowledge-driven prompts for pan-tumor segmentation.<n>Specifically, text and anatomical prompts provide domain-specific priors guiding tumor representation learning, while D-MoE dynamically selects experts to balance generic and tumor-specific feature learning.<n>Experiments on multi-anatomical tumor datasets demonstrate that MAST-Pro outperforms state-of-the-art approaches, achieving up to a 5.20% improvement in average improvement while reducing trainable parameters by 91.04%, without compromising accuracy.
arXiv Detail & Related papers (2025-03-18T15:39:44Z)
Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts. MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation. MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Small Language Models Learn Enhanced Reasoning Skills from Medical Textbooks [17.40940406100025]
We introduce Meerkat, a new family of medical AI systems ranging from 7 to 70 billion parameters. Our systems achieved remarkable accuracy across six medical benchmarks. Meerkat-70B correctly diagnosed 21 out of 38 complex clinical cases, outperforming humans' 13.8.
arXiv Detail & Related papers (2024-03-30T14:09:00Z)
Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology. For training, we assemble a large dataset of over 697 thousand radiology image-text pairs. For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation. The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z)
SoftTiger: A Clinical Foundation Model for Healthcare Workflows [5.181665205189493]
We introduce SoftTiger, a clinical large language model (CLaM) designed as a foundation model for healthcare. We collect and annotate data for three subtasks, namely, international patient summary, clinical impression and medical encounter. We supervised fine-tuned a state-of-the-art LLM using public and credentialed clinical data.
arXiv Detail & Related papers (2024-03-01T04:39:16Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)
Do We Still Need Clinical Language Models? [15.023633270864675]
We show that relatively small specialized clinical models substantially outperform all in-context learning approaches. We release the code and the models used under the PhysioNet Credentialed Health Data license and data use agreement.
arXiv Detail & Related papers (2023-02-16T05:08:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.