Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language
Models
- URL: http://arxiv.org/abs/2312.09211v3
- Date: Sat, 13 Jan 2024 13:52:16 GMT
- Title: Mitigating Outlier Activations in Low-Precision Fine-Tuning of Language
Models
- Authors: Alireza Ghaffari, Justin Yu, Mahsa Ghazvini Nejad, Masoud Asgharian,
Boxing Chen, Vahid Partovi Nia
- Abstract summary: Low-precision fine-tuning of language models is susceptible to the existence of outlier values in activation.
This paper investigates techniques for mitigating outlier activation in low-precision integer fine-tuning of the language models.
- Score: 23.233652898606536
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Low-precision fine-tuning of language models has gained prominence as a
cost-effective and energy-efficient approach to deploying large-scale models in
various applications. However, this approach is susceptible to the existence of
outlier values in activation. The outlier values in the activation can
negatively affect the performance of fine-tuning language models in the
low-precision regime since they affect the scaling factor and thus make
representing smaller values harder. This paper investigates techniques for
mitigating outlier activation in low-precision integer fine-tuning of the
language models. Our proposed novel approach enables us to represent the
outlier activation values in 8-bit integers instead of floating-point (FP16)
values. The benefit of using integers for outlier values is that it enables us
to use operator tiling to avoid performing 16-bit integer matrix multiplication
to address this problem effectively. We provide theoretical analysis and
supporting experiments to demonstrate the effectiveness of our approach in
improving the robustness and performance of low-precision fine-tuned language
models.
Related papers
- Forgetting: A New Mechanism Towards Better Large Language Model Fine-tuning [53.398270878295754]
Supervised fine-tuning (SFT) plays a critical role for pretrained large language models (LLMs)<n>We suggest categorizing tokens within each corpus into two parts -- positive and negative tokens -- based on whether they are useful to improve model performance.<n>We conduct experiments on well-established benchmarks, finding that this forgetting mechanism not only improves overall model performance and also facilitate more diverse model responses.
arXiv Detail & Related papers (2025-08-06T11:22:23Z) - Internal Value Alignment in Large Language Models through Controlled Value Vector Activation [70.41805604556058]
We introduce a Controlled Value Vector Activation (ConVA) method to align Large Language Models (LLMs) with human values.<n>To consistently control values without sacrificing model performance, we introduce a gated value vector activation method.<n>Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency.
arXiv Detail & Related papers (2025-07-15T13:48:35Z) - Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.
LCD can distort the global distribution over strings, sampling tokens based only on local information.
We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z) - Less is More: Efficient Black-box Attribution via Minimal Interpretable Subset Selection [52.716143424856185]
We propose LiMA (Less input is More faithful for Attribution), which reformulates the attribution of important regions as an optimization problem for submodular subset selection.
LiMA identifies both the most and least important samples while ensuring an optimal attribution boundary that minimizes errors.
Our method also outperforms the greedy search in attribution efficiency, being 1.6 times faster.
arXiv Detail & Related papers (2025-04-01T06:58:15Z) - Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference [0.0]
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference.
EAD switches between different-sized models based on prediction uncertainty.
We show remarkable efficiency gains across different model families.
arXiv Detail & Related papers (2025-02-05T22:15:21Z) - Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Activation Scaling for Steering and Interpreting Language Models [55.59689963561315]
We argue that successfully intervening on a model is a prerequisite for interpreting its internal workings.
We establish a three-term objective: a successful intervention should flip the correct with the wrong token and vice versa.
Using gradient-based optimization, this objective lets us learn (and later evaluate) a specific kind of efficient and interpretable intervention.
arXiv Detail & Related papers (2024-10-07T12:01:32Z) - TernaryLLM: Ternarized Large Language Model [29.29122031050894]
Large language models (LLMs) have achieved remarkable performance on Natural Language Processing (NLP) tasks.
We introduce Dual Learnable Ternarization (DLT), which enables both scales and shifts to be learnable.
We also propose Outlier-Friendly Feature Knowledge Distillation (OFF) to recover the information lost in extremely low-bit quantization.
arXiv Detail & Related papers (2024-06-11T11:40:12Z) - Advancing the Robustness of Large Language Models through Self-Denoised Smoothing [50.54276872204319]
Large language models (LLMs) have achieved significant success, but their vulnerability to adversarial perturbations has raised considerable concerns.
We propose to leverage the multitasking nature of LLMs to first denoise the noisy inputs and then to make predictions based on these denoised versions.
Unlike previous denoised smoothing techniques in computer vision, which require training a separate model to enhance the robustness of LLMs, our method offers significantly better efficiency and flexibility.
arXiv Detail & Related papers (2024-04-18T15:47:00Z) - Effective internal language model training and fusion for factorized transducer model [26.371223360905557]
Internal language model (ILM) of the neural transducer has been widely studied.
We propose a novel ILM training and decoding strategy for factorized transducer models.
arXiv Detail & Related papers (2024-04-02T08:01:05Z) - Dynamic Transformers Provide a False Sense of Efficiency [75.39702559746533]
Multi-exit models make a trade-off between efficiency and accuracy, where the saving of computation comes from an early exit.
We propose a simple yet effective attacking framework, SAME, which is specially tailored to reduce the efficiency of the multi-exit models.
Experiments on the GLUE benchmark show that SAME can effectively diminish the efficiency gain of various multi-exit models by 80% on average.
arXiv Detail & Related papers (2023-05-20T16:41:48Z) - On the Eigenvalues of Global Covariance Pooling for Fine-grained Visual
Recognition [65.67315418971688]
We show that truncating small eigenvalues of the Global Covariance Pooling (GCP) can attain smoother gradient.
On fine-grained datasets, truncating the small eigenvalues would make the model fail to converge.
Inspired by this observation, we propose a network branch dedicated to magnifying the importance of small eigenvalues.
arXiv Detail & Related papers (2022-05-26T11:41:36Z) - Distributionally Robust Multilingual Machine Translation [94.51866646879337]
We propose a new learning objective for Multilingual neural machine translation (MNMT) based on distributionally robust optimization.
We show how to practically optimize this objective for large translation corpora using an iterated best response scheme.
Our method consistently outperforms strong baseline methods in terms of average and per-language performance under both many-to-one and one-to-many translation settings.
arXiv Detail & Related papers (2021-09-09T03:48:35Z) - Cold-start Active Learning through Self-supervised Language Modeling [15.551710499866239]
Active learning aims to reduce annotation costs by choosing the most critical examples to label.
With BERT, we develop a simple strategy based on the masked language modeling loss.
Compared to other baselines, our approach reaches higher accuracy within less sampling iterations and time.
arXiv Detail & Related papers (2020-10-19T14:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.