Knowledge Injection via Prompt Distillation
- URL: http://arxiv.org/abs/2412.14964v1
- Date: Thu, 19 Dec 2024 15:44:01 GMT
- Title: Knowledge Injection via Prompt Distillation
- Authors: Kalle Kujanpää, Harri Valpola, Alexander Ilin,
- Abstract summary: We propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG.<n>The proposed method is based on the self-distillation approach, which we call prompt distillation.
- Score: 48.66694777454059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In many practical applications, large language models (LLMs) need to incorporate new knowledge not present in their pre-training data. The primary methods for this are fine-tuning and retrieval-augmented generation (RAG). Although RAG has emerged as the industry standard for knowledge injection, fine-tuning has not yet achieved comparable success. In this paper, we propose a new fine-tuning technique for learning new knowledge and show that it can reach the performance of RAG. The proposed method is based on the self-distillation approach, which we call prompt distillation. First, we generate question-answer pairs about the new knowledge. Then, we fine-tune a student model on the question-answer pairs to imitate the output distributions of a teacher model, which additionally receives the new knowledge in its prompt. The student model is identical to the teacher, except it is equipped with a LoRA adapter. This training procedure facilitates distilling the new knowledge from the teacher's prompt into the student's weights.
Related papers
- Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models.<n>Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start.<n>Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z) - ELAD: Explanation-Guided Large Language Models Active Distillation [16.243249111524403]
The deployment and application of Large Language Models (LLMs) is hindered by their memory inefficiency, computational demands, and the high costs of API inferences.
Traditional distillation methods, which transfer the capabilities of LLMs to smaller models, often fail to determine whether the knowledge has been sufficiently transferred.
We propose an Explanation-Guided LLMs Active Distillation (ELAD) framework that employs an active learning strategy to optimize the balance between annotation costs and model performance.
arXiv Detail & Related papers (2024-02-20T15:47:59Z) - Knowledge-Augmented Reasoning Distillation for Small Language Models in
Knowledge-Intensive Tasks [90.11273439036455]
Large Language Models (LLMs) have shown promising performance in knowledge-intensive reasoning tasks.
We propose Knowledge-Augmented Reasoning Distillation (KARD), a novel method that fine-tunes small LMs to generate rationales from LLMs with augmented knowledge retrieved from an external knowledge base.
We empirically show that KARD significantly improves the performance of small T5 and GPT models on the challenging knowledge-intensive reasoning datasets.
arXiv Detail & Related papers (2023-05-28T13:00:00Z) - AD-KD: Attribution-Driven Knowledge Distillation for Language Model
Compression [26.474962405945316]
We present a novel attribution-driven knowledge distillation approach to compress pre-trained language models.
To enhance the knowledge transfer of model reasoning and generalization, we explore multi-view attribution distillation on all potential decisions of the teacher.
arXiv Detail & Related papers (2023-05-17T07:40:12Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - Improved Knowledge Distillation for Pre-trained Language Models via
Knowledge Selection [35.515135913846386]
We propose an actor-critic approach to selecting appropriate knowledge to transfer during the process of knowledge distillation.
Experimental results on the GLUE datasets show that our method outperforms several strong knowledge distillation baselines significantly.
arXiv Detail & Related papers (2023-02-01T13:40:19Z) - Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency.
Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model.
We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z) - A Closer Look at Knowledge Distillation with Features, Logits, and
Gradients [81.39206923719455]
Knowledge distillation (KD) is a substantial strategy for transferring learned knowledge from one neural network model to another.
This work provides a new perspective to motivate a set of knowledge distillation strategies by approximating the classical KL-divergence criteria with different knowledge sources.
Our analysis indicates that logits are generally a more efficient knowledge source and suggests that having sufficient feature dimensions is crucial for the model design.
arXiv Detail & Related papers (2022-03-18T21:26:55Z) - Multi-Teacher Knowledge Distillation for Incremental Implicitly-Refined
Classification [37.14755431285735]
We propose a novel Multi-Teacher Knowledge Distillation (MTKD) strategy for incremental learning.
To preserve the superclass knowledge, we use the initial model as a superclass teacher to distill the superclass knowledge for the student model.
We propose a post-processing mechanism, called as Top-k prediction restriction to reduce the redundant predictions.
arXiv Detail & Related papers (2022-02-23T09:51:40Z) - Dynamic Rectification Knowledge Distillation [0.0]
Dynamic Rectification Knowledge Distillation (DR-KD) is a knowledge distillation framework.
DR-KD transforms the student into its own teacher, and if the self-teacher makes wrong predictions while distilling information, the error is rectified prior to the knowledge being distilled.
Our proposed DR-KD performs remarkably well in the absence of a sophisticated cumbersome teacher model.
arXiv Detail & Related papers (2022-01-27T04:38:01Z) - Annealing Knowledge Distillation [5.396407687999048]
We propose an improved knowledge distillation method (called Annealing-KD) by feeding the rich information provided by the teacher's soft-targets incrementally and more efficiently.
This paper includes theoretical and empirical evidence as well as practical experiments to support the effectiveness of our Annealing-KD method.
arXiv Detail & Related papers (2021-04-14T23:45:03Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Learning Student-Friendly Teacher Networks for Knowledge Distillation [50.11640959363315]
We propose a novel knowledge distillation approach to facilitate the transfer of dark knowledge from a teacher to a student.
Contrary to most of the existing methods that rely on effective training of student models given pretrained teachers, we aim to learn the teacher models that are friendly to students.
arXiv Detail & Related papers (2021-02-12T07:00:17Z) - Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation [55.34995029082051]
We propose a method to learn to augment for data-scarce domain BERT knowledge distillation.
We show that the proposed method significantly outperforms state-of-the-art baselines on four different tasks.
arXiv Detail & Related papers (2021-01-20T13:07:39Z) - DE-RRD: A Knowledge Distillation Framework for Recommender System [16.62204445256007]
We propose a knowledge distillation framework for recommender system, called DE-RRD.
It enables the student model to learn from the latent knowledge encoded in the teacher model as well as from the teacher's predictions.
Our experiments show that DE-RRD outperforms the state-of-the-art competitors and achieves comparable or even better performance to that of the teacher model with faster inference time.
arXiv Detail & Related papers (2020-12-08T11:09:22Z) - Learning to Reweight with Deep Interactions [104.68509759134878]
We propose an improved data reweighting algorithm, in which the student model provides its internal states to the teacher model.
Experiments on image classification with clean/noisy labels and neural machine translation empirically demonstrate that our algorithm makes significant improvement over previous methods.
arXiv Detail & Related papers (2020-07-09T09:06:31Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.