Related papers: FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training

URL: http://arxiv.org/abs/2501.09213v3
Date: Wed, 30 Jul 2025 08:05:40 GMT
Title: FineMedLM-o1: Enhancing Medical Knowledge Reasoning Ability of LLM from Supervised Fine-Tuning to Test-Time Training
Authors: Hongzhou Yu, Tianhao Cheng, Yingwen Wang, Wen He, Qing Wang, Ying Cheng, Yuejie Zhang, Rui Feng, Xiaobo Zhang,
Abstract summary: FineMedLM-o1 is a medical large language model with deep reasoning capabilities.<n>We introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning.<n>The project and data will be released on GitHub.
Score: 20.259483872569987
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have shown promise in medical applications such as disease diagnosis and treatment planning. However, most existing medical LLMs struggle with the deep reasoning required for complex medical problems, such as differential diagnosis and medication recommendations. We propose FineMedLM-o1, which leverages high-quality medical synthetic data and long-form reasoning data for Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO), enabling advanced dialogue and deep reasoning capabilities. Additionally, we introduce Test-Time Training (TTT) in the medical domain for the first time, facilitating domain adaptation and ensuring reliable, accurate reasoning. Experimental results demonstrate that FineMedLM-o1 achieves a 23% average performance improvement over prior models on key medical benchmarks. Furthermore, the introduction of TTT provides an additional 14% performance boost, highlighting its effectiveness in enhancing medical reasoning capabilities. To support this process, we also propose a novel method for synthesizing medical dialogue. Compared to other open-source datasets, our dataset stands out as superior in both quality and complexity. The project and data will be released on GitHub.

Related papers

Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning [57.873833577058]
We build a multimodal dataset enriched with extensive medical knowledge.<n>We then introduce our medical-specialized MLLM: Lingshu.<n>Lingshu undergoes multi-stage training to embed medical expertise and enhance its task-solving capabilities.
arXiv Detail & Related papers (2025-06-08T08:47:30Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
IIMedGPT: Promoting Large Language Model Capabilities of Medical Tasks by Efficient Human Preference Alignment [6.022433954095106]
We introduce a medical instruction dataset, CMedINS, containing six medical instructions derived from actual medical tasks.<n>We then launch our medical model, IIMedGPT, employing an efficient preference alignment method.<n>The results show that our final model outperforms existing medical models in medical dialogue.
arXiv Detail & Related papers (2025-01-06T09:22:36Z)
CareBot: A Pioneering Full-Process Open-Source Medical Language Model [8.868481107848185]
CareBot is a bilingual medical LLM that integrates continuous pre-training (CPT), supervised fine-tuning (SFT), and reinforcement learning with human feedback (RLHF)<n>DataRater is a model designed to assess data quality during CPT, ensuring that the training data is both accurate and relevant.<n>Our rigorous evaluations on Chinese and English benchmarks confirm CareBot's effectiveness in medical consultation and education.
arXiv Detail & Related papers (2024-12-12T05:27:43Z)
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization [25.937453082034448]
We propose MMedPO, a novel multimodal medical preference optimization approach.<n> MMedPO considers the clinical relevance of preference samples to enhance Med-LVLM alignment.<n>Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs.
arXiv Detail & Related papers (2024-12-09T01:50:39Z)
STLLaVA-Med: Self-Training Large Language and Vision Assistant for Medical Question-Answering [58.79671189792399]
STLLaVA-Med is designed to train a policy model capable of auto-generating medical visual instruction data. We validate the efficacy and data efficiency of STLLaVA-Med across three major medical Visual Question Answering (VQA) benchmarks.
arXiv Detail & Related papers (2024-06-28T15:01:23Z)
Med42 -- Evaluating Fine-Tuning Strategies for Medical LLMs: Full-Parameter vs. Parameter-Efficient Approaches [7.3384872719063114]
We develop and refined a series of medical Large Language Models (LLMs) based on the Llama-2 architecture. Our experiments systematically evaluate the effectiveness of these tuning strategies across various well-known medical benchmarks.
arXiv Detail & Related papers (2024-04-23T06:36:21Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Large Language Model Distilling Medication Recommendation Model [61.89754499292561]
We harness the powerful semantic comprehension and input-agnostic characteristics of Large Language Models (LLMs) Our research aims to transform existing medication recommendation methodologies using LLMs. To mitigate this, we have developed a feature-level knowledge distillation technique, which transfers the LLM's proficiency to a more compact model.
arXiv Detail & Related papers (2024-02-05T08:25:22Z)
Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback [19.564416963801268]
We propose an approach called preference learning from process feedback. PLPF integrates the doctor's diagnostic logic into LLMs. We show that PLPF enhances the diagnostic accuracy of the baseline model in medical conversations by 17.6%.
arXiv Detail & Related papers (2024-01-11T06:42:45Z)
ChiMed-GPT: A Chinese Medical Large Language Model with Full Training Regime and Better Alignment to Human Preferences [51.66185471742271]
We propose ChiMed-GPT, a benchmark LLM designed explicitly for Chinese medical domain. ChiMed-GPT undergoes a comprehensive training regime with pre-training, SFT, and RLHF. We analyze possible biases through prompting ChiMed-GPT to perform attitude scales regarding discrimination of patients.
arXiv Detail & Related papers (2023-11-10T12:25:32Z)
Large Language Models for Healthcare Data Augmentation: An Example on Patient-Trial Matching [49.78442796596806]
We propose an innovative privacy-aware data augmentation approach for patient-trial matching (LLM-PTM) Our experiments demonstrate a 7.32% average improvement in performance using the proposed LLM-PTM method, and the generalizability to new data is improved by 12.12%.
arXiv Detail & Related papers (2023-03-24T03:14:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.