Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
- URL: http://arxiv.org/abs/2602.17475v1
- Date: Thu, 19 Feb 2026 15:38:46 GMT
- Title: Small LLMs for Medical NLP: a Systematic Analysis of Few-Shot, Constraint Decoding, Fine-Tuning and Continual Pre-Training in Italian
- Authors: Pietro Ferrazzi, Mattia Franzin, Alberto Lavelli, Bernardo Magnini,
- Abstract summary: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks.<n>In this work, we investigate whether "small" LLMs can effectively perform medical tasks while maintaining competitive accuracy.<n>We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks.
- Score: 2.415128123637063
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) consistently excel in diverse medical Natural Language Processing (NLP) tasks, yet their substantial computational requirements often limit deployment in real-world healthcare settings. In this work, we investigate whether "small" LLMs (around one billion parameters) can effectively perform medical tasks while maintaining competitive accuracy. We evaluate models from three major families-Llama-3, Gemma-3, and Qwen3-across 20 clinical NLP tasks among Named Entity Recognition, Relation Extraction, Case Report Form Filling, Question Answering, and Argument Mining. We systematically compare a range of adaptation strategies, both at inference time (few-shot prompting, constraint decoding) and at training time (supervised fine-tuning, continual pretraining). Fine-tuning emerges as the most effective approach, while the combination of few-shot prompting and constraint decoding offers strong lower-resource alternatives. Our results show that small LLMs can match or even surpass larger baselines, with our best configuration based on Qwen3-1.7B achieving an average score +9.2 points higher than Qwen3-32B. We release a comprehensive collection of all the publicly available Italian medical datasets for NLP tasks, together with our top-performing models. Furthermore, we release an Italian dataset of 126M words from the Emergency Department of an Italian Hospital, and 175M words from various sources that we used for continual pre-training.
Related papers
- Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation [0.0]
This paper compares five Large Language Models (LLMs) deployed between April 2024 and August 2025 for medical QA.<n>Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini.<n>Results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks.
arXiv Detail & Related papers (2026-02-16T08:53:23Z) - A Federated and Parameter-Efficient Framework for Large Language Model Training in Medicine [59.78991974851707]
Large language models (LLMs) have demonstrated strong performance on medical benchmarks, including question answering and diagnosis.<n>Most medical LLMs are trained on data from a single institution, which faces limitations in generalizability and safety in heterogeneous systems.<n>We introduce the model-agnostic and parameter-efficient federated learning framework for adapting LLMs to medical applications.
arXiv Detail & Related papers (2026-01-29T18:48:21Z) - Enhancing the Medical Context-Awareness Ability of LLMs via Multifaceted Self-Refinement Learning [49.559151128219725]
Large language models (LLMs) have shown great promise in the medical domain, achieving strong performance on several benchmarks.<n>However, they continue to underperform in real-world medical scenarios, which often demand stronger context-awareness.<n>We propose Multifaceted Self-Refinement (MuSeR), a data-driven approach that enhances LLMs' context-awareness along three key facets.
arXiv Detail & Related papers (2025-11-13T08:13:23Z) - Evaluating Open-Weight Large Language Models for Structured Data Extraction from Narrative Medical Reports Across Multiple Use Cases and Languages [2.3429123017483016]
Large language models (LLMs) are increasingly used to extract structured information from free-text clinical records.<n>We evaluated 15 open-weight LLMs on pathology and radiology reports across six use cases.
arXiv Detail & Related papers (2025-11-03T12:32:01Z) - Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA [0.6015898117103068]
Medical Visual Question Answering (MedVQA) enables natural language queries over medical images to support clinical decision-making and patient care.<n>We present the MasonNLP system, which employs a general-domain, instruction-tuned large language model with a retrieval-augmented generation (RAG) framework.<n>Best-performing system ranked 3rd among 19 teams and 51 submissions with an average score of 41.37%.
arXiv Detail & Related papers (2025-10-12T07:03:58Z) - MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z) - Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks [2.7729041396205014]
This study evaluates the classification performance of five open-source large language models (LLMs)<n>We report precision, recall, and F1 scores with 95% confidence intervals for all model-task combinations.
arXiv Detail & Related papers (2025-03-19T12:51:52Z) - LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z) - Towards Evaluating and Building Versatile Large Language Models for Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv Detail & Related papers (2024-08-22T17:01:34Z) - Qibo: A Large Language Model for Traditional Chinese Medicine [10.394665777883064]
In traditional Chinese medicine, there are challenges such as the essential differences between theory and modern medicine.
We propose a two-stage training approach that combines continuous pre-training and supervised fine-tuning.
A notable contribution of our study is the processing of a 2GB corpus dedicated to TCM.
arXiv Detail & Related papers (2024-03-24T07:48:05Z) - Large Language Model Distilling Medication Recommendation Model [58.94186280631342]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs)<n>Our research aims to transform existing medication recommendation methodologies using LLMs.<n>To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.