Related papers: DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning

DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning

URL: http://arxiv.org/abs/2504.14810v1
Date: Mon, 21 Apr 2025 02:25:03 GMT
Title: DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning
Authors: Jucheng Hu, Surong Yang, Dongzhan Zhou, Lijun Wu,
Abstract summary: Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation.<n>We propose DONOD, a lightweight model-intrinsic data pruning method.<n>By filtering out 70% of the full dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
Score: 22.704995231753397
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation. While domain-specific supervised fine-tuning (SFT) is effective and efficient, it often weakens cross-domain generalization and struggles with noisy training data. To address these challenges, we propose DONOD, a lightweight model-intrinsic data pruning method. Our approach evaluates data using two model-parameter-based metrics: Delta of Norm (DON), which captures the cumulative influence on model weights, and Norm of Delta (NOD), which quantifies weight instability. Moreover, by employing the Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) algorithm, we effectively filter noisy, unlearnable, and generalization-harming samples without relying on auxiliary models during the SFT process. Experiments on mathematical tasks demonstrate that data selected by DONOD achieve superior fine-tuning efficiency and improved robustness against noisy data. By filtering out 70% of the full dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%. Meanwhile, our selected data present superior cross-architecture generalization. Data pruned by smaller models (e.g., Llama 3.1-8B) generalize effectively on larger models (e.g., Llama 2-13B). Compared to existing related methodologies, DONOD demonstrates comparable or superior performance while remaining dataset-agnostic, enabling broader applicability.

Related papers

InfiAlign: A Scalable and Sample-Efficient Framework for Aligning LLMs to Enhance Reasoning Capabilities [27.09178257629886]
InfiAlign is a scalable and sample-efficient post-training framework for large language models (LLMs)<n>At the core of InfiAlign is a robust data selection pipeline that automatically curates high-quality alignment data from open-source reasoning.<n>Our results highlight the effectiveness of combining principled data selection with full-stage post-training.
arXiv Detail & Related papers (2025-08-07T15:34:06Z)
Efficient Data Selection at Scale via Influence Distillation [53.03573620682107]
This paper introduces Influence Distillation, a mathematicallyjustified framework for data selection.<n>By distilling each sample's influence on a target distribution, our method assigns model-specific weights that are used to select training data.<n>Experiments show that Influence Distillation matches or outperforms state-of-the-art performance while achieving up to $3.5times$ faster selection.
arXiv Detail & Related papers (2025-05-25T09:08:00Z)
Towards Robust Universal Information Extraction: Benchmark, Evaluation, and Solution [66.11004226578771]
Existing robust benchmark datasets have two key limitations.<n>They generate only a limited range of perturbations for a single Information Extraction (IE) task.<n>Considering the powerful generation capabilities of Large Language Models (LLMs), we introduce a new benchmark dataset for Robust UIE, called RUIE-Bench.<n>We show that training with only textbf15% of the data leads to an average textbf7.5% relative performance improvement across three IE tasks.
arXiv Detail & Related papers (2025-03-05T05:39:29Z)
A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation [0.562479170374811]
In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness.
arXiv Detail & Related papers (2024-11-23T17:35:23Z)
When to Trust Your Data: Enhancing Dyna-Style Model-Based Reinforcement Learning With Data Filter [7.886307329450978]
Dyna-style algorithms combine two approaches by using simulated data from an estimated environmental model to accelerate model-free training. Previous works address this issue by using model ensembles or pretraining the estimated model with data collected from the real environment. We introduce an out-of-distribution data filter that removes simulated data from the estimated model that significantly diverges from data collected in the real environment.
arXiv Detail & Related papers (2024-10-16T01:49:03Z)
Adaptive Data Optimization: Dynamic Sample Selection with Scaling Laws [59.03420759554073]
We introduce Adaptive Data Optimization (ADO), an algorithm that optimize data distributions in an online fashion, concurrent with model training. ADO does not require external knowledge, proxy models, or modifications to the model update. ADO uses per-domain scaling laws to estimate the learning potential of each domain during training and adjusts the data mixture accordingly.
arXiv Detail & Related papers (2024-10-15T17:47:44Z)
LESS: Selecting Influential Data for Targeted Instruction Tuning [64.78894228923619]
We propose LESS, an efficient algorithm to estimate data influences and perform Low-rank gradiEnt Similarity Search for instruction data selection. We show that training on a LESS-selected 5% of the data can often outperform training on the full dataset across diverse downstream tasks. Our method goes beyond surface form cues to identify data that the necessary reasoning skills for the intended downstream application.
arXiv Detail & Related papers (2024-02-06T19:18:04Z)
AutoFT: Learning an Objective for Robust Fine-Tuning [60.641186718253735]
Foundation models encode rich representations that can be adapted to downstream tasks by fine-tuning. Current approaches to robust fine-tuning use hand-crafted regularization techniques. We propose AutoFT, a data-driven approach for robust fine-tuning.
arXiv Detail & Related papers (2024-01-18T18:58:49Z)
Functional Graphical Models: Structure Enables Offline Data-Driven Optimization [111.28605744661638]
We show how structure can enable sample-efficient data-driven optimization. We also present a data-driven optimization algorithm that infers the FGM structure itself.
arXiv Detail & Related papers (2024-01-08T22:33:14Z)
Breaking Boundaries: Balancing Performance and Robustness in Deep Wireless Traffic Forecasting [11.029214459961114]
Balancing the trade-off between accuracy and robustness is a long-standing challenge in time series forecasting. We study a wide array of perturbation scenarios and propose novel defense mechanisms against adversarial attacks using real-world telecom data.
arXiv Detail & Related papers (2023-11-16T11:10:38Z)
Fairer and More Accurate Tabular Models Through NAS [14.147928131445852]
We propose using multi-objective Neural Architecture Search (NAS) and Hyperparameter Optimization (HPO) in the first application to the very challenging domain of tabular data. We show that models optimized solely for accuracy with NAS often fail to inherently address fairness concerns. We produce architectures that consistently dominate state-of-the-art bias mitigation methods either in fairness, accuracy or both.
arXiv Detail & Related papers (2023-10-18T17:56:24Z)
Hyperparameter-free Continuous Learning for Domain Classification in Natural Language Understanding [60.226644697970116]
Domain classification is the fundamental task in natural language understanding (NLU) Most existing continual learning approaches suffer from low accuracy and performance fluctuation. We propose a hyper parameter-free continual learning model for text data that can stably produce high performance under various environments.
arXiv Detail & Related papers (2022-01-05T02:46:16Z)
Robust Optimal Transport with Applications in Generative Modeling and Domain Adaptation [120.69747175899421]
Optimal Transport (OT) distances such as Wasserstein have been used in several areas such as GANs and domain adaptation. We propose a computationally-efficient dual form of the robust OT optimization that is amenable to modern deep learning applications. Our approach can train state-of-the-art GAN models on noisy datasets corrupted with outlier distributions.
arXiv Detail & Related papers (2020-10-12T17:13:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.