Related papers: Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care

Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care

URL: http://arxiv.org/abs/2510.05410v1
Date: Mon, 06 Oct 2025 22:04:37 GMT
Title: Aligning Language Models with Clinical Expertise: DPO for Heart Failure Nursing Documentation in Critical Care
Authors: Junyi Fan, Li Sun, Negin Ashrafi, Kamiar Alaei, Maryam Pishgar,
Abstract summary: This study applies Direct Preference Optimization to adapt Mistral-7B, a locally deployable language model, using 8,838 heart failure nursing notes.<n> Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert qualitative assessments demonstrates that DPO markedly enhances documentation quality.
Score: 4.108872110731109
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Nursing documentation in intensive care units (ICUs) provides essential clinical intelligence but often suffers from inconsistent terminology, informal styles, and lack of standardization, challenges that are particularly critical in heart failure care. This study applies Direct Preference Optimization (DPO) to adapt Mistral-7B, a locally deployable language model, using 8,838 heart failure nursing notes from the MIMIC-III database and 21,210 preference pairs derived from expert-verified GPT outputs, model generations, and original notes. Evaluation across BLEU, ROUGE, BERTScore, Perplexity, and expert qualitative assessments demonstrates that DPO markedly enhances documentation quality. Specifically, BLEU increased by 84% (0.173 to 0.318), BERTScore improved by 7.6% (0.828 to 0.891), and expert ratings rose across accuracy (+14.4 points), completeness (+14.5 points), logical consistency (+14.1 points), readability (+11.1 points), and structural clarity (+6.0 points). These results indicate that DPO can align lightweight clinical language models with expert standards, supporting privacy-preserving, AI-assisted documentation within electronic health record systems to reduce administrative burden and improve ICU patient safety.

Related papers

PanCanBench: A Comprehensive Benchmark for Evaluating Large Language Models in Pancreatic Oncology [48.732366302949515]
Large language models (LLMs) have achieved expert-level performance on standardized examinations, yet multiple-choice accuracy poorly reflects real-world clinical utility and safety.<n>We developed a human-in-the-loop pipeline to create expert rubrics for de-identified patient questions.<n>We evaluated 22 proprietary and open-source LLMs using an LLM-as-a-judge framework, measuring clinical completeness, factual accuracy, and web-search integration.
arXiv Detail & Related papers (2026-03-02T00:50:39Z)
Automated Rubrics for Reliable Evaluation of Medical Dialogue Systems [19.880569341968023]
Large Language Models (LLMs) are increasingly used for clinical decision support, where hallucinations and unsafe suggestions may pose direct risks to patient safety.<n>We propose a retrieval-augmented multi-agent framework designed to automate the generation of instance-specific evaluation rubrics.
arXiv Detail & Related papers (2026-01-21T16:40:41Z)
A DeepSeek-Powered AI System for Automated Chest Radiograph Interpretation in Clinical Practice [83.11942224668127]
Janus-Pro-CXR (1B) is a chest X-ray interpretation system based on DeepSeek Janus-Pro model.<n>Our system outperforms state-of-the-art X-ray report generation models in automated report generation.
arXiv Detail & Related papers (2025-12-23T13:26:13Z)
Cancer Diagnosis Categorization in Electronic Health Records Using Large Language Models and BioBERT: Model Performance Evaluation Study [0.1625256372381793]
We analyzed 762 unique diagnoses (326 International Classification of Diseases (ICD) code descriptions, 436free-text entries) from 3456 records of patients with cancer.<n>Models were tested on their ability to categorize diagnoses into 14predefined categories.<n>GPT-3.5, Gemini, and Llama showed lower overall performance on both formats.
arXiv Detail & Related papers (2025-10-08T16:50:40Z)
Grounding Large Language Models in Clinical Evidence: A Retrieval-Augmented Generation System for Querying UK NICE Clinical Guidelines [1.9615061725959186]
This paper presents the development and evaluation of a Retrieval-Augmented Generation system for querying the United Kingdom's National Institute for Health and Care Excellence (NICE) clinical guidelines using Large Language Models (LLMs)<n>The system's retrieval architecture, composed of a hybrid embedding mechanism, was evaluated against a database of 10,195 text chunks derived from three hundred guidelines.<n>It demonstrates high performance, with a Mean Reciprocal Rank (MRR) of 0.814, a Recall of 81% at the first chunk and of 99.1% within the top ten retrieved chunks, when evaluated on 7901 queries.
arXiv Detail & Related papers (2025-10-03T12:57:13Z)
EchoBench: Benchmarking Sycophancy in Medical Large Vision-Language Models [82.43729208063468]
Recent benchmarks for medical Large Vision-Language Models (LVLMs) emphasize leaderboard accuracy, overlooking reliability and safety.<n>We study sycophancy -- models' tendency to uncritically echo user-provided information.<n>We introduce EchoBench, a benchmark to systematically evaluate sycophancy in medical LVLMs.
arXiv Detail & Related papers (2025-09-24T14:09:55Z)
Evaluating Large Language Models for Evidence-Based Clinical Question Answering [4.101088122511548]
Large Language Models (LLMs) have demonstrated substantial progress in biomedical and clinical applications.<n>We curate a benchmark drawing from Cochrane systematic reviews and clinical guidelines.<n>We observe consistent performance patterns across sources and clinical domains.
arXiv Detail & Related papers (2025-09-13T15:03:34Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Preserving Privacy, Increasing Accessibility, and Reducing Cost: An On-Device Artificial Intelligence Model for Medical Transcription and Note Generation [0.0]
We develop and evaluate a privacy-preserving, on-device medical transcription system using a fine-tuned Llama 3.2 1B model.<n>The model is capable of generating structured medical notes from medical transcriptions while maintaining complete data sovereignty entirely in the browser.
arXiv Detail & Related papers (2025-07-03T01:51:49Z)
MedHELM: Holistic Evaluation of Large Language Models for Medical Tasks [47.486705282473984]
Large language models (LLMs) achieve near-perfect scores on medical exams.<n>These evaluations inadequately reflect complexity and diversity of real-world clinical practice.<n>We introduce MedHELM, an evaluation framework for assessing LLM performance for medical tasks.
arXiv Detail & Related papers (2025-05-26T22:55:49Z)
Assessing the Quality of AI-Generated Clinical Notes: A Validated Evaluation of a Large Language Model Scribe [0.0]
We developed a blinded study comparing the relative performance of large language model (LLM) generated clinical notes with those from field experts based on audio-recorded clinical encounters.<n> Quantitative metrics from the Physician Documentation Quality Instrument (PDQI9) provided a framework to measure note quality.<n>We found a modest yet significant difference in the overall note quality, wherein Gold notes achieved a score of 4.25 out of 5 and Ambient notes scored 4.20 out of 5.
arXiv Detail & Related papers (2025-05-15T16:14:53Z)
A Multi-Phase Analysis of Blood Culture Stewardship: Machine Learning Prediction, Expert Recommendation Assessment, and LLM Automation [2.25639842999394]
Blood cultures are often over ordered without clear justification.<n>In study of 135483 emergency department (ED) blood culture orders, we developed machine learning (ML) models to predict the risk of bacteremia.
arXiv Detail & Related papers (2025-04-09T21:12:29Z)
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records [60.35217378132709]
Large language models (LLMs) can follow natural language instructions with human-level fluency. evaluating LLMs on realistic text generation tasks for healthcare remains challenging. We introduce MedAlign, a benchmark dataset of 983 natural language instructions for EHR data.
arXiv Detail & Related papers (2023-08-27T12:24:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.