Related papers: EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs

EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs

URL: http://arxiv.org/abs/2402.12419v1
Date: Mon, 19 Feb 2024 09:55:32 GMT
Title: EBFT: Effective and Block-Wise Fine-Tuning for Sparse LLMs
Authors: Song Guo, Fan Wu, Lei Zhang, Xiawu Zheng, Shengchuan Zhang, Fei Chao, Yiyu Shi, Rongrong Ji
Abstract summary: Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. We propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error.
Score: 68.41135269685576
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing methods for fine-tuning sparse LLMs often suffer from resource-intensive requirements and high retraining costs. Additionally, many fine-tuning methods often rely on approximations or heuristic optimization strategies, which may lead to suboptimal solutions. To address these issues, we propose an efficient and fast framework for fine-tuning sparse LLMs based on minimizing reconstruction error. Our approach involves sampling a small dataset for calibration and utilizing backpropagation to iteratively optimize block-wise reconstruction error, on a block-by-block basis, aiming for optimal solutions. Extensive experiments on various benchmarks consistently demonstrate the superiority of our method over other baselines. For instance, on the Wikitext2 dataset with LlamaV1-7B at 70% sparsity, our proposed EBFT achieves a perplexity of 16.88, surpassing the state-of-the-art DSnoT with a perplexity of 75.14. Moreover, with a structured sparsity ratio of 26\%, EBFT achieves a perplexity of 16.27, outperforming LoRA (perplexity 16.44). Furthermore, the fine-tuning process of EBFT for LlamaV1-7B only takes approximately 30 minutes, and the entire framework can be executed on a single 16GB GPU. The source code is available at https://github.com/sunggo/EBFT.

Related papers

Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. This refers to the cumulative effect of reconstruction errors throughout the sparsification process. We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z)
Progressive Binarization with Semi-Structured Pruning for LLMs [36.32239429974179]
Large language models (LLMs) have achieved remarkable success in natural language processing tasks. Their high computational and memory demands pose challenges for deployment on resource-constrained devices. We propose a Progressive Binarization with Semi-Structured Pruning (PBS$2$P) method for LLM compression.
arXiv Detail & Related papers (2025-02-03T13:30:29Z)
HASSLE-free: A unified Framework for Sparse plus Low-Rank Matrix Decomposition for LLMs [15.575498324678373]
A promising compression scheme is to decompose foundation models' dense weights into a sum of sparse plus low-rank matrices. In this paper, we design a unified framework coined HASSLE-free for (semi-structured) sparse plus low-rank matrix decomposition.
arXiv Detail & Related papers (2025-02-02T20:23:32Z)
Theoretically Grounded Pruning of Large Ground Sets for Constrained, Discrete Optimization [12.016449555335976]
We develop light-weight pruning algorithms to discard elements that are unlikely to be part of an optimal solution. Under mild assumptions, we prove theoretical guarantees on the fraction of the optimal value retained and the size of the resulting pruned ground set. Our algorithm, QuickPrune, efficiently prunes over 90% of the ground set and outperforms state-of-the-art classical and machine learnings for pruning.
arXiv Detail & Related papers (2024-10-23T15:18:07Z)
Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research. Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration. Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z)
A Convex-optimization-based Layer-wise Post-training Pruner for Large Language Models [24.185245582500876]
We introduce FISTAPruner, the first post-training pruner based on convex optimization models and algorithms. FISTAPruner incorporates an intra-layer cumulative error correction mechanism and supports parallel pruning. We evaluate FISTAPruner on models such as OPT, LLaMA, LLaMA-2, and LLaMA-3 with 125M to 70B parameters under unstructured and 2:4 semi-structured sparsity.
arXiv Detail & Related papers (2024-08-07T12:33:46Z)
Sparsity-Constraint Optimization via Splicing Iteration [1.3622424109977902]
We develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE) SCOPE converges effectively without tuning parameters. We apply SCOPE to solve quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables. Our open-source Python package skscope based on C++ implementation is publicly available on GitHub.
arXiv Detail & Related papers (2024-06-17T18:34:51Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
ALPS: Improved Optimization for Highly Sparse One-Shot Pruning for Large Language Models [14.310720048047136]
ALPS is an optimization-based framework that tackles the pruning problem using the operator splitting technique and a preconditioned gradient conjugate-based post-processing step. Our approach incorporates novel techniques to accelerate and theoretically guarantee convergence while leveraging vectorization and GPU parallelism for efficiency. On the OPT-30B model with 70% sparsity, ALPS achieves a 13% reduction in test perplexity on the WikiText dataset and a 19% improvement in zero-shot benchmark performance compared to existing methods.
arXiv Detail & Related papers (2024-06-12T02:57:41Z)
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences. To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model. Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z)
Federated Learning of Large Language Models with Parameter-Efficient Prompt Tuning and Adaptive Optimization [71.87335804334616]
Federated learning (FL) is a promising paradigm to enable collaborative model training with decentralized data. The training process of Large Language Models (LLMs) generally incurs the update of significant parameters. This paper proposes an efficient partial prompt tuning approach to improve performance and efficiency simultaneously.
arXiv Detail & Related papers (2023-10-23T16:37:59Z)
Learning Optimal Solutions via an LSTM-Optimization Framework [0.0]
We present a deep learning-optimization framework to tackle dynamic mixed-integer programs. We develop a bidirectional Long Short Term Memory (LSTM) framework that can process information forward and backward in time. We demonstrate our approach in predicting the optimal decisions for the single-item capacitated lot-sizing problem.
arXiv Detail & Related papers (2022-07-06T19:38:01Z)
Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance [53.49803579981569]
We consider a global objective for contrastive learning, which contrasts each positive pair with all negative pairs for an anchor point. Existing methods such as SimCLR requires a large batch size in order to achieve a satisfactory result. We propose a memory-efficient optimization algorithm for solving the Global Contrastive Learning of Representations, named SogCLR.
arXiv Detail & Related papers (2022-02-24T22:16:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.