Related papers: Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization

URL: http://arxiv.org/abs/2504.18080v1
Date: Fri, 25 Apr 2025 05:15:31 GMT
Title: Stabilizing Reasoning in Medical LLMs with Continued Pretraining and Reasoning Preference Optimization
Authors: Wataru Kawakami, Keita Suzuki, Junichiro Iwasawa,
Abstract summary: This paper introduces Preferred-MedLLM-Qwen-72B, a 72B- parameter model optimized for the Japanese medical domain.<n>We employ a two-stage fine-tuning process on the Qwen2.5-72B base model to achieve both high accuracy and stable reasoning.
Score: 0.06554326244334867
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) show potential in medicine, yet clinical adoption is hindered by concerns over factual accuracy, language-specific limitations (e.g., Japanese), and critically, their reliability when required to generate reasoning explanations -- a prerequisite for trust. This paper introduces Preferred-MedLLM-Qwen-72B, a 72B-parameter model optimized for the Japanese medical domain to achieve both high accuracy and stable reasoning. We employ a two-stage fine-tuning process on the Qwen2.5-72B base model: first, Continued Pretraining (CPT) on a comprehensive Japanese medical corpus instills deep domain knowledge. Second, Reasoning Preference Optimization (RPO), a preference-based method, enhances the generation of reliable reasoning pathways while preserving high answer accuracy. Evaluations on the Japanese Medical Licensing Exam benchmark (IgakuQA) show Preferred-MedLLM-Qwen-72B achieves state-of-the-art performance (0.868 accuracy), surpassing strong proprietary models like GPT-4o (0.866). Crucially, unlike baseline or CPT-only models which exhibit significant accuracy degradation (up to 11.5\% and 3.8\% respectively on IgakuQA) when prompted for explanations, our model maintains its high accuracy (0.868) under such conditions. This highlights RPO's effectiveness in stabilizing reasoning generation. This work underscores the importance of optimizing for reliable explanations alongside accuracy. We release the Preferred-MedLLM-Qwen-72B model weights to foster research into trustworthy LLMs for specialized, high-stakes applications.

Related papers

Accurate and Interpretable Postmenstrual Age Prediction via Multimodal Large Language Model [4.176819645111134]
We introduce a parameter-efficient fine-tuning (PEFT) strategy using instruction tuning and Low-Rank Adaptation (LoRA)<n>By employing distinct prompts for training and inference, our approach enables the MLLM to handle a regression task during training and generate clinically relevant explanations during inference.<n>The fine-tuned model achieves a low prediction error with a 95 percent confidence interval of 0.78 to 1.52 weeks, while producing interpretable outputs grounded in developmental features.
arXiv Detail & Related papers (2025-08-04T15:35:36Z)
Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z)
Med-PRM: Medical Reasoning Models with Stepwise, Guideline-verified Process Rewards [21.831262938278915]
We introduce Med-PRM, a process reward modeling framework to verify each reasoning step against established medical knowledge bases.<n>Med-PRM achieves state-of-the-art performance, with improving the performance of base models by up to 13.50%.<n>We demonstrate the generality of Med-PRM by integrating it in a plug-and-play fashion with strong policy models such as Meerkat.
arXiv Detail & Related papers (2025-06-13T05:36:30Z)
Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references. We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey. Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z)
Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model [1.7064514726335305]
We analyzed 9,683 Hebrew radiology reports from Crohn's disease patients.<n>We incorporated uncertainty-aware prompt ensembles and an agent-based decision model.
arXiv Detail & Related papers (2025-02-02T16:57:03Z)
Leveraging Large Language Models to Enhance Machine Learning Interpretability and Predictive Performance: A Case Study on Emergency Department Returns for Mental Health Patients [2.3769374446083735]
Emergency department (ED) returns for mental health conditions pose a major healthcare burden, with 24-27% of patients returning within 30 days.<n>To assess whether integrating large language models (LLMs) with machine learning improves predictive accuracy and clinical interpretability of ED mental health return risk models.
arXiv Detail & Related papers (2025-01-21T15:41:20Z)
Detecting Bias and Enhancing Diagnostic Accuracy in Large Language Models for Healthcare [0.2302001830524133]
Biased AI-generated medical advice and misdiagnoses can jeopardize patient safety. This study introduces new resources designed to promote ethical and precise AI in healthcare.
arXiv Detail & Related papers (2024-10-09T06:00:05Z)
Adapting LLMs for the Medical Domain in Portuguese: A Study on Fine-Tuning and Model Evaluation [1.922611370494431]
This study evaluates the performance of large language models (LLMs) as medical agents in Portuguese. The InternLM2 model, with initial training on medical data, presented the best overall performance. DrBode models, derived from ChatBode, exhibited a phenomenon of catastrophic forgetting of acquired medical knowledge.
arXiv Detail & Related papers (2024-09-30T19:10:03Z)
Eir: Thai Medical Large Language Models [0.0]
Eir-8B is a large language model with 8 billion parameters designed to enhance the accuracy of handling medical tasks in the Thai language. Human evaluation was conducted to ensure that the model adheres to care standards and provides unbiased answers. The model is deployed within the hospital's internal network, ensuring both high security and faster processing speeds.
arXiv Detail & Related papers (2024-09-13T04:06:00Z)
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs [54.05511925104712]
We propose a simple, effective, and data-efficient method called Step-DPO. Step-DPO treats individual reasoning steps as units for preference optimization rather than evaluating answers holistically. Our findings demonstrate that as few as 10K preference data pairs and fewer than 500 Step-DPO training steps can yield a nearly 3% gain in accuracy on MATH for models with over 70B parameters.
arXiv Detail & Related papers (2024-06-26T17:43:06Z)
Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences. We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game. Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z)
Exploiting prompt learning with pre-trained language models for Alzheimer's Disease detection [70.86672569101536]
Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression. This paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function.
arXiv Detail & Related papers (2022-10-29T09:18:41Z)
UNITE: Uncertainty-based Health Risk Prediction Leveraging Multi-sourced Data [81.00385374948125]
We present UNcertaInTy-based hEalth risk prediction (UNITE) model. UNITE provides accurate disease risk prediction and uncertainty estimation leveraging multi-sourced health data. We evaluate UNITE on real-world disease risk prediction tasks: nonalcoholic fatty liver disease (NASH) and Alzheimer's disease (AD) UNITE achieves up to 0.841 in F1 score for AD detection, up to 0.609 in PR-AUC for NASH detection, and outperforms various state-of-the-art baselines by up to $19%$ over the best baseline.
arXiv Detail & Related papers (2020-10-22T02:28:11Z)
Providing reliability in Recommender Systems through Bernoulli Matrix Factorization [63.732639864601914]
This paper proposes Bernoulli Matrix Factorization (BeMF) to provide both prediction values and reliability values. BeMF acts on model-based collaborative filtering rather than on memory-based filtering. The more reliable a prediction is, the less liable it is to be wrong.
arXiv Detail & Related papers (2020-06-05T14:24:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.