DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
- URL: http://arxiv.org/abs/2411.14055v1
- Date: Thu, 21 Nov 2024 12:02:39 GMT
- Title: DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
- Authors: Hexuan Deng, Wenxiang Jiao, Xuebo Liu, Min Zhang, Zhaopeng Tu,
- Abstract summary: Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs.
We propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains.
- Score: 61.492590008258986
- License:
- Abstract: Large language models (LLMs) deliver impressive results but face challenges from increasing model sizes and computational costs. Structured pruning reduces model size and speeds up inference but often causes uneven degradation across domains, leading to biased performance. To address this, we propose DRPruning, which incorporates distributionally robust optimization to restore balanced performance across domains, along with further improvements to enhance robustness. Experiments in monolingual and multilingual settings show that our method surpasses similarly sized models in pruning and continued pretraining over perplexity, downstream tasks, and instruction tuning. We further provide analysis demonstrating the robustness of our method towards various domains and distribution shifts. Furthermore, our method automatically determines optimal reference losses and data ratios, suggesting potential for broader applications. Our code is available at https://github.com/hexuandeng/DRPruning.
Related papers
- Task-level Distributionally Robust Optimization for Large Language Model-based Dense Retrieval [32.104911827710936]
We propose a new task-level Distributionally Robust Optimization (tDRO) algorithm for Large Language Model-based Dense Retrieval fine-tuning.
The tDRO parameterizes the domain weights and updates them with scaled domain gradients.
Experiments show optimal improvements in large-scale retrieval benchmarks and reduce up to 30% dataset usage.
arXiv Detail & Related papers (2024-08-20T07:48:19Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step.
Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency.
On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z) - Multi-Reference Preference Optimization for Large Language Models [56.84730239046117]
We introduce a novel closed-form formulation for direct preference optimization using multiple reference models.
The resulting algorithm, Multi-Reference Preference Optimization (MRPO), leverages broader prior knowledge from diverse reference models.
Our experiments demonstrate that LLMs finetuned with MRPO generalize better in various preference data, regardless of data scarcity or abundance.
arXiv Detail & Related papers (2024-05-26T00:29:04Z) - Leveraging Synthetic Targets for Machine Translation [5.302421715411791]
We show that training models on synthetic targets outperforms training on the actual ground-truth data.
We provide preliminary analysis into whether this boost in performance is linked to ease of optimization or more deterministic nature of the predictions.
arXiv Detail & Related papers (2023-05-07T07:42:22Z) - Fine-Tuning Pre-Trained Language Models Effectively by Optimizing
Subnetworks Adaptively [32.001304911395756]
We propose a Dynamic Selection (DPS) algorithm for the large-scale pre-trained models during fine-tuning.
Experiments on the GLUE benchmark show that DPS outperforms previous fine-tuning methods in terms of overall performance and stability.
arXiv Detail & Related papers (2022-11-03T08:32:12Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Modeling the Second Player in Distributionally Robust Optimization [90.25995710696425]
We argue for the use of neural generative models to characterize the worst-case distribution.
This approach poses a number of implementation and optimization challenges.
We find that the proposed approach yields models that are more robust than comparable baselines.
arXiv Detail & Related papers (2021-03-18T14:26:26Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.