DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
- URL: http://arxiv.org/abs/2411.14055v2
- Date: Tue, 27 May 2025 02:52:03 GMT
- Title: DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
- Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Jing Li, Min Zhang, Zhaopeng Tu,
- Abstract summary: Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.<n>We propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data.<n> Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning.
- Score: 59.96455188197593
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, a method that dynamically adjusts the data distribution during training to restore balanced performance across heterogeneous and multi-tasking data. Experiments in monolingual and multilingual settings show that DRPruning surpasses similarly sized models in both pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. Further analysis demonstrates the robustness of DRPruning towards various domains and distribution shifts. Furthermore, DRPruning can determine optimal reference losses and data ratios automatically, suggesting potential for broader applications. Code and scripts are available at https://github.com/hexuandeng/DRPruning.
Related papers
- SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining [74.83412846804977]
Reinforcement learning (RL)-based fine-tuning has become a crucial step in post-training language models.<n>We present a systematic end-to-end study of RL fine-tuning for mathematical reasoning by training models entirely from scratch.
arXiv Detail & Related papers (2025-04-10T17:15:53Z) - Sample-aware Adaptive Structured Pruning for Large Language Models [14.605017410864583]
This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for large language models (LLMs)
Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space.
At a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.
arXiv Detail & Related papers (2025-03-08T12:00:21Z) - OPTISHEAR: Towards Efficient and Adaptive Pruning of Large Language Models via Evolutionary Optimization [18.57876883968734]
We introduce textbftextscOptiShear, an efficient evolutionary optimization framework for adaptive LLM pruning.
Our framework features two key innovations: an effective search space built on our Meta pruning metric, and a model-wise reconstruction error for rapid evaluation.
arXiv Detail & Related papers (2025-02-15T09:17:38Z) - DReSS: Data-driven Regularized Structured Streamlining for Large Language Models [30.47317140878219]
Large language models (LLMs) have achieved significant progress across various domains, but their increasing scale results in high computational and memory costs.<n>We propose a novel paradigm that first applies regularization, then prunes, and finally finetunes.<n>By leveraging a small amount of data to regularize the components to be pruned, DReSS explicitly transfers the important information to the remaining parts of the model in advance.
arXiv Detail & Related papers (2025-01-29T14:28:11Z) - Instruction-Following Pruning for Large Language Models [58.329978053711024]
We move beyond the traditional static pruning approach of determining a fixed pruning mask for a model.
In our method, the pruning mask is input-dependent and adapts dynamically based on the information described in a user instruction.
Our approach, termed "instruction-following pruning", introduces a sparse mask predictor that takes the user instruction as input and dynamically selects the most relevant model parameters for the given task.
arXiv Detail & Related papers (2025-01-03T20:19:14Z) - Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.
Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.
To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z) - Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval [32.104911827710936]
We propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for Large Language Model-based Dense Retrieval fine-tuning.
The tDRO parameterizes the domain weights and updates them with scaled domain gradients.
Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage.
arXiv Detail & Related papers (2024-08-20T07:48:19Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Not All Data Matters: An End-to-End Adaptive Dataset Pruning Framework
for Enhancing Model Performance and Efficiency [9.460023981858319]
We propose an end-to-end Adaptive DAtaset PRUNing framework called AdaPruner.
AdaPruner iteratively prunes redundant samples to an expected pruning ratio.
It can still significantly enhance model performance even after pruning up to 10-30% of the training data.
arXiv Detail & Related papers (2023-12-09T16:01:21Z) - A Machine Learning Approach to Improving Timing Consistency between
Global Route and Detailed Route [3.202646674984817]
Inaccurate timing prediction wastes design effort, hurts circuit performance, and may lead to design failure.
This work focuses on timing prediction after clock tree synthesis and placement legalization, which is the earliest opportunity to time and optimize a "complete" netlist.
To bridge the gap between GR-based parasitic and timing estimation and post-DR results during post-GR optimization, machine learning (ML)-based models are proposed.
arXiv Detail & Related papers (2023-05-11T16:01:23Z) - Leveraging Synthetic Targets for Machine Translation [5.302421715411791]
We show that training models on synthetic targets outperforms training on the actual ground-truth data.
We provide preliminary analysis into whether this boost in performance is linked to ease of optimization or more deterministic nature of the predictions.
arXiv Detail & Related papers (2023-05-07T07:42:22Z) - Tailoring Language Generation Models under Total Variation Distance [55.89964205594829]
The standard paradigm of neural language generation adopts maximum likelihood estimation (MLE) as the optimizing method.
We develop practical bounds to apply it to language generation.
We introduce the TaiLr objective that balances the tradeoff of estimating TVD.
arXiv Detail & Related papers (2023-02-26T16:32:52Z) - Fine-Tuning Pre-Trained Language Models Effectively by Optimizing
Subnetworks Adaptively [32.001304911395756]
We propose a Dynamic Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning.
Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability.
arXiv Detail & Related papers (2022-11-03T08:32:12Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Scaling Structured Inference with Randomization [64.18063627155128]
We propose a family of dynamic programming (RDP) randomized for scaling structured models to tens of thousands of latent states.
Our method is widely applicable to classical DP-based inference.
It is also compatible with automatic differentiation so can be integrated with neural networks seamlessly.
arXiv Detail & Related papers (2021-12-07T11:26:41Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution.
This approach poses a number of implementation and optimization challenges.
We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.