Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
- URL: http://arxiv.org/abs/2508.01450v1
- Date: Sat, 02 Aug 2025 17:50:35 GMT
- Title: Towards Efficient Medical Reasoning with Minimal Fine-Tuning Data
- Authors: Xinlin Zhuang, Feilong Tang, Haolin Yang, Ming Hu, Huifa Li, Haochen Xue, Yichen Li, Junjun He, Zongyuan Ge, Ying Qian, Imran Razzak,
- Abstract summary: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning.<n>Existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity.<n>We propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant.
- Score: 30.407699113696076
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Supervised Fine-Tuning (SFT) plays a pivotal role in adapting Large Language Models (LLMs) to specialized domains such as medical reasoning. However, existing SFT practices often rely on unfiltered datasets that contain redundant and low-quality samples, leading to substantial computational costs and suboptimal performance. Although existing methods attempt to alleviate this problem by selecting data based on sample difficulty, defined by knowledge and reasoning complexity, they overlook each sample's optimization utility reflected in its gradient. Interestingly, we find that gradient-based influence alone favors easy-to-optimize samples that cause large parameter shifts but lack deep reasoning chains, while difficulty alone selects noisy or overly complex cases that fail to guide stable optimization. Based on this observation, we propose a data selection strategy, Difficulty-Influence Quadrant (DIQ), which prioritizes samples in the high-difficulty-high-influence quadrant to balance complex clinical reasoning with substantial gradient influence, enabling efficient medical reasoning with minimal fine-tuning data. Furthermore, Human and LLM-as-a-judge evaluations show that DIQ-selected subsets demonstrate higher data quality and generate clinical reasoning that is more aligned with expert practices in differential diagnosis, safety check, and evidence citation, as DIQ emphasizes samples that foster expert-like reasoning patterns. Extensive experiments on medical reasoning benchmarks demonstrate that DIQ enables models fine-tuned on only 1% of selected data to match full-dataset performance, while using 10% consistently outperforms the baseline, highlighting the superiority of principled data selection over brute-force scaling. The code and data are available at https://github.com/mihara-bot/DIQ.
Related papers
- InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z) - TACOS: Open Tagging and Comparative Scoring for Instruction Fine-Tuning Data Selection [9.020110377060153]
We present TACOS, an innovative method that integrates Open Tagging and Comparative Scoring for IFT data selection.<n>To capture data diversity, we leverage LLMs to assign open-domain tags to human queries.<n>We suggest a comparative scoring method that allows the relative quality evaluation of samples within a cluster, avoiding inconsistent criteria seen in singleton-based evaluations.
arXiv Detail & Related papers (2025-07-04T15:46:07Z) - A Novel Double Pruning method for Imbalanced Data using Information Entropy and Roulette Wheel Selection for Breast Cancer Diagnosis [2.8661021832561757]
The SMOTEBoost method generates synthetic data to balance the dataset, but it may overlook crucial overlapping regions near the decision boundary.<n>This paper proposes RE-SMOTEBoost, an enhanced version of SMOTEBoost, designed to overcome these limitations.<n>It incorporates a filtering mechanism based on information entropy to reduce noise, and borderline cases and improve the quality of generated data.
arXiv Detail & Related papers (2025-03-15T19:34:15Z) - FastMCTS: A Simple Sampling Strategy for Data Synthesis [67.60823802317141]
We introduce FastMCTS, an innovative data synthesis strategy inspired by Monte Carlo Tree Search.<n>FastMCTS provides a more efficient sampling method for multi-step reasoning data, offering step-level evaluation signals.<n>Experiments on both English and Chinese reasoning datasets demonstrate that FastMCTS generates over 30% more correct reasoning paths.
arXiv Detail & Related papers (2025-02-17T06:27:57Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.<n>Data selection has shown promise in identifying the most representative samples from the entire dataset.<n>We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - Reward-Augmented Data Enhances Direct Preference Alignment of LLMs [63.32585910975191]
We introduce reward-conditioned Large Language Models (LLMs) that learn from the entire spectrum of response quality within the dataset.<n>We show that our approach consistently boosts DPO by a considerable margin.<n>Our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion.
arXiv Detail & Related papers (2024-10-10T16:01:51Z) - Investigating the Impact of Hard Samples on Accuracy Reveals In-class Data Imbalance [4.291589126905706]
In the AutoML domain, test accuracy is heralded as the quintessential metric for evaluating model efficacy.
However, the reliability of test accuracy as the primary performance metric has been called into question.
The distribution of hard samples between training and test sets affects the difficulty levels of those sets.
We propose a benchmarking procedure for comparing hard sample identification methods.
arXiv Detail & Related papers (2024-09-22T11:38:14Z) - Not All Samples Should Be Utilized Equally: Towards Understanding and Improving Dataset Distillation [57.6797306341115]
We take an initial step towards understanding various matching-based DD methods from the perspective of sample difficulty.<n>We then extend the neural scaling laws of data pruning to DD to theoretically explain these matching-based methods.<n>We introduce the Sample Difficulty Correction (SDC) approach, designed to predominantly generate easier samples to achieve higher dataset quality.
arXiv Detail & Related papers (2024-08-22T15:20:32Z) - LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection.
We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks.
Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z) - Towards Accelerated Model Training via Bayesian Data Selection [45.62338106716745]
We propose a more reasonable data selection principle by examining the data's impact on the model's generalization loss.
Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss.
This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models.
arXiv Detail & Related papers (2023-08-21T07:58:15Z) - Delving into Identify-Emphasize Paradigm for Combating Unknown Bias [52.76758938921129]
We propose an effective bias-conflicting scoring method (ECS) to boost the identification accuracy.
We also propose gradient alignment (GA) to balance the contributions of the mined bias-aligned and bias-conflicting samples.
Experiments are conducted on multiple datasets in various settings, demonstrating that the proposed solution can mitigate the impact of unknown biases.
arXiv Detail & Related papers (2023-02-22T14:50:24Z) - Split-PU: Hardness-aware Training Strategy for Positive-Unlabeled
Learning [42.26185670834855]
Positive-Unlabeled (PU) learning aims to learn a model with rare positive samples and abundant unlabeled samples.
This paper focuses on improving the commonly-used nnPU with a novel training pipeline.
arXiv Detail & Related papers (2022-11-30T05:48:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.