Related papers: BitDelta: Your Fine-Tune May Only Be Worth One Bit

BitDelta: Your Fine-Tune May Only Be Worth One Bit

URL: http://arxiv.org/abs/2402.10193v2
Date: Wed, 28 Feb 2024 03:42:10 GMT
Title: BitDelta: Your Fine-Tune May Only Be Worth One Bit
Authors: James Liu, Guangxuan Xiao, Kai Li, Jason D. Lee, Song Han, Tri Dao, Tianle Cai
Abstract summary: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x.
Score: 60.44468282930883
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.

Related papers

ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs [9.435738597849447]
ImPart is a novel importance-aware delta sparsification approach. It adjusts sparsity ratios of different singular vectors based on their importance.
arXiv Detail & Related papers (2025-04-17T16:39:36Z)
Seeing Delta Parameters as JPEG Images: Data-Free Delta Compression with Discrete Cosine Transform [51.29604910007176]
We introduce Delta-DCT, the first data-free delta compression method inspired by classic JPEG image compression, leveraging the Discrete Cosine Transform (DCT) The proposed Delta-DCT does not require any training or data calibration, while achieving performance comparable to or even surpassing original finetuned models under 1-bit equivalent delta compression ratios on different kinds of models including: (1) recently-released LLMs of different sizes from 7B to 13B, (2) relatively smaller language models including RoBERTa and T5 models, (3) variants of vision transformer models, and (4) multi-modal BEiT-3 models.
arXiv Detail & Related papers (2025-03-09T16:03:48Z)
DeltaDQ: Ultra-High Delta Compression for Fine-Tuned LLMs via Group-wise Dropout and Separate Quantization [17.501956455837707]
Large language models achieve exceptional performance on various downstream tasks through supervised fine-tuning. Current methods that compress the delta weight struggle to achieve ultra-high compression. We propose a novel distribution-driven delta compression framework DeltaDQ to achieve ultra-high compression for the delta weight.
arXiv Detail & Related papers (2024-10-11T09:44:16Z)
ME-Switch: A Memory-Efficient Expert Switching Framework for Large Language Models [43.29533894162248]
LLM development involves pre-training a foundation model on massive data, followed by fine-tuning on task-specific data to create specialized experts. Previous approaches decompose the expert weights as the pre-trained weights plus delta weights, followed by quantizing the delta weights to reduce the model size. We introduce ME-Switch, a memory-efficient expert switching framework tailored for serving multiple LLMs.
arXiv Detail & Related papers (2024-06-13T12:27:55Z)
OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models [81.7855202178564]
We present OpenDelta, an open-source library that overcomes limitations by providing a plug-and-play implementation of various delta tuning methods. Our novel techniques eliminate the need to modify the backbone PTMs' code, making OpenDelta compatible with different, even novel PTMs.
arXiv Detail & Related papers (2023-07-05T16:30:14Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
$\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time. We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies. Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Memory-Efficient Pipeline-Parallel DNN Training [27.83107540482083]
PipeDream-2BW is a system that supports memory-efficient pipeline parallelism. It can accelerate the training of large GPT and BERT language models by up to 20$times$ with similar final model accuracy.
arXiv Detail & Related papers (2020-06-16T20:33:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.