Related papers: Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications

URL: http://arxiv.org/abs/2310.00867v3
Date: Fri, 16 Feb 2024 18:39:45 GMT
Title: Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications
Authors: Duc N.M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang Wang
Abstract summary: Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
Score: 63.29358103217275
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Compressing Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. In this work, we dive into how compression damages LLMs' inherent knowledge and the possible remedies. We start by proposing two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after LLM compression, hence necessitating the compressed model to (re)learn from data with additional parameters; the other presumes that knowledge is internally displaced and hence one requires merely "inference re-direction" with input-side augmentation such as prompting, to recover the knowledge-related performance. Extensive experiments are then designed to (in)validate the two conjectures. We observe the promise of prompting in comparison to model tuning; we further unlock prompting's potential by introducing a variant called Inference-time Dynamic Prompting (IDP), that can effectively increase prompt diversity without incurring any inference overhead. Our experiments consistently suggest that compared to the classical re-training alternatives such as LoRA, prompting with IDP leads to better or comparable post-compression performance recovery, while saving the extra parameter size by 21x and reducing inference latency by 60%. Our experiments hence strongly endorse the conjecture of "knowledge displaced" over "knowledge forgotten", and shed light on a new efficient mechanism to restore compressed LLM performance. We additionally visualize and analyze the different attention and activation patterns between prompted and re-trained models, demonstrating they achieve performance recovery in two different regimes.

Related papers

EARN: Efficient Inference Acceleration for LLM-based Generative Recommendation by Register Tokens [47.60523011706102]
Large Language Model-based generative recommendation (LLMRec) has achieved notable success, but it suffers from high inference latency.<n>We propose EARN, an efficient inference framework that leverages the early layers to compress information into register tokens placed at the input sequence boundaries.
arXiv Detail & Related papers (2025-07-01T12:42:06Z)
From Parameters to Prompts: Understanding and Mitigating the Factuality Gap between Fine-Tuned LLMs [4.447729258258283]
We study the factuality gap that arises when fine-tuning on known versus unknown knowledge.<n>Our results shed light on the interaction between finetuning data and test-time prompt.
arXiv Detail & Related papers (2025-05-29T12:59:30Z)
GRAIT: Gradient-Driven Refusal-Aware Instruction Tuning for Effective Hallucination Mitigation [62.63014905981601]
Refusal-Aware Instruction Tuning (RAIT) aims to enhance Large Language Models (LLMs) by improving their ability to refuse responses to questions beyond their knowledge. Effective RAIT must address two key challenges: firstly, effectively reject unknown questions to minimize hallucinations; secondly, avoid over-refusal to ensure questions that can be correctly answered are not rejected. GraIT employs gradient-driven sample selection to effectively minimize hallucinations and (2) introduces an adaptive weighting mechanism during fine-tuning to reduce the risk of over-refusal.
arXiv Detail & Related papers (2025-02-09T14:11:30Z)
DESIRE: Dynamic Knowledge Consolidation for Rehearsal-Free Continual Learning [23.878495627964146]
Continual learning aims to equip models with the ability to retain previously learned knowledge like a human. Existing methods usually overlook the issue of information leakage caused by the fact that the experiment data have been used in pre-trained models. In this paper, we propose a new LoRA-based rehearsal-free method named DESIRE.
arXiv Detail & Related papers (2024-11-28T13:54:01Z)
Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions. Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z)
An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token. We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains. Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z)
Mixture of Experts Meets Prompt-Based Continual Learning [23.376460019465235]
This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning. We provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism. The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms.
arXiv Detail & Related papers (2024-05-23T02:49:57Z)
ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent [50.508669199496474]
We develop a ReAct-style LLM agent with the ability to reason and act upon external knowledge. We refine the agent through a ReST-like method that iteratively trains on previous trajectories. Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model.
arXiv Detail & Related papers (2023-12-15T18:20:15Z)
The Cost of Compression: Investigating the Impact of Compression on Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment. Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits. Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy. More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning [47.904127007515925]
We study a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction. We prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic approximation guarantees as their counterparts. Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling.
arXiv Detail & Related papers (2023-01-03T04:09:38Z)
Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks. We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator. To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.