Do Compressed LLMs Forget Knowledge? An Experimental Study with
Practical Implications
- URL: http://arxiv.org/abs/2310.00867v3
- Date: Fri, 16 Feb 2024 18:39:45 GMT
- Title: Do Compressed LLMs Forget Knowledge? An Experimental Study with
Practical Implications
- Authors: Duc N.M Hoang, Minsik Cho, Thomas Merth, Mohammad Rastegari, Zhangyang
Wang
- Abstract summary: Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks.
We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression.
We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
- Score: 63.29358103217275
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compressing Large Language Models (LLMs) often leads to reduced performance,
especially for knowledge-intensive tasks. In this work, we dive into how
compression damages LLMs' inherent knowledge and the possible remedies. We
start by proposing two conjectures on the nature of the damage: one is certain
knowledge being forgotten (or erased) after LLM compression, hence
necessitating the compressed model to (re)learn from data with additional
parameters; the other presumes that knowledge is internally displaced and hence
one requires merely "inference re-direction" with input-side augmentation such
as prompting, to recover the knowledge-related performance. Extensive
experiments are then designed to (in)validate the two conjectures. We observe
the promise of prompting in comparison to model tuning; we further unlock
prompting's potential by introducing a variant called Inference-time Dynamic
Prompting (IDP), that can effectively increase prompt diversity without
incurring any inference overhead. Our experiments consistently suggest that
compared to the classical re-training alternatives such as LoRA, prompting with
IDP leads to better or comparable post-compression performance recovery, while
saving the extra parameter size by 21x and reducing inference latency by 60%.
Our experiments hence strongly endorse the conjecture of "knowledge displaced"
over "knowledge forgotten", and shed light on a new efficient mechanism to
restore compressed LLM performance. We additionally visualize and analyze the
different attention and activation patterns between prompted and re-trained
models, demonstrating they achieve performance recovery in two different
regimes.
Related papers
- Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.
Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z) - An Early FIRST Reproduction and Improvements to Single-Token Decoding for Fast Listwise Reranking [50.81324768683995]
FIRST is a novel approach that integrates a learning-to-rank objective and leveraging the logits of only the first generated token.
We extend the evaluation of FIRST to the TREC Deep Learning datasets (DL19-22), validating its robustness across diverse domains.
Our experiments confirm that fast reranking with single-token logits does not compromise out-of-domain reranking quality.
arXiv Detail & Related papers (2024-11-08T12:08:17Z) - Mixture of Experts Meets Prompt-Based Continual Learning [23.376460019465235]
This paper conducts a theoretical analysis to unravel how prompts bestow such advantages in continual learning.
We provide a novel view on prefix tuning, reframing it as the addition of new task-specific experts, thereby inspiring the design of a novel gating mechanism.
The effectiveness of NoRGa is substantiated both theoretically and empirically across diverse benchmarks and pretraining paradigms.
arXiv Detail & Related papers (2024-05-23T02:49:57Z) - ReST meets ReAct: Self-Improvement for Multi-Step Reasoning LLM Agent [50.508669199496474]
We develop a ReAct-style LLM agent with the ability to reason and act upon external knowledge.
We refine the agent through a ReST-like method that iteratively trains on previous trajectories.
Starting from a prompted large model and after just two iterations of the algorithm, we can produce a fine-tuned small model.
arXiv Detail & Related papers (2023-12-15T18:20:15Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - Temporal Difference Learning with Compressed Updates: Error-Feedback meets Reinforcement Learning [47.904127007515925]
We study a variant of the classical temporal difference (TD) learning algorithm with a perturbed update direction.
We prove that compressed TD algorithms, coupled with an error-feedback mechanism used widely in optimization, exhibit the same non-asymptotic approximation guarantees as their counterparts.
Notably, these are the first finite-time results in RL that account for general compression operators and error-feedback in tandem with linear function approximation and Markovian sampling.
arXiv Detail & Related papers (2023-01-03T04:09:38Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.