BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
- URL: http://arxiv.org/abs/2406.13555v2
- Date: Wed, 11 Sep 2024 12:19:14 GMT
- Title: BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation
- Authors: Minchong Li, Feng Zhou, Xiaohui Song,
- Abstract summary: Large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks.
Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model.
In this paper, we explore the task-specific distillation of LLMs at the logit level.
- Score: 4.577173950430005
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks. However, such impressive performance often comes with the trade-off of an increased parameter size, posing significant challenges for widespread deployment. Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model. In this paper, we explore the task-specific distillation of LLMs at the logit level. Our investigation reveals that the logits of fine-tuned LLMs exhibit a more extreme long-tail distribution than those from vision models, with hidden "noise" in the long tail affecting distillation performance. Furthermore, existing logits distillation methods often struggle to effectively utilize the internal ranking information from the logits. To address these, we propose the Bi-directional Logits Difference (BiLD) loss. The BiLD loss filters out the long-tail noise by utilizing only top-$k$ teacher and student logits, and leverages the internal logits ranking information by constructing logits differences. To evaluate BiLD loss, we conduct comprehensive experiments on 13 datasets using two types of LLMs. Our results show that the BiLD loss, with only the top-8 logits, outperforms supervised fine-tuning (SFT), vanilla KL loss, and five other distillation methods from both NLP and CV fields.
Related papers
- Multi-MLLM Knowledge Distillation for Out-of-Context News Detection [17.41734069411864]
Multimodal out-of-context news is a type of misinformation in which the image is used outside of its original context.<n>We introduce a two-stage knowledge distillation framework to transfer this knowledge to a student MLLM.<n>In Stage 1, we apply LoRA fine-tuning to the student model using all training data.<n>In Stage 2, we further fine-tune the student model using both LoRA fine-tuning and DPO on the data points where teachers' predictions conflict.
arXiv Detail & Related papers (2025-05-28T16:03:41Z) - ABKD: Pursuing a Proper Allocation of the Probability Mass in Knowledge Distillation via $α$-$β$-Divergence [89.630486749083]
Knowledge Distillation (KD) transfers knowledge from a large teacher model to a smaller student model.<n>The core challenge in KD lies in balancing two mode-concentration effects.<n>We propose ABKD, a generic framework with $alpha$$beta$-divergence.
arXiv Detail & Related papers (2025-05-07T16:48:49Z) - Mitigating Forgetting in LLM Fine-Tuning via Low-Perplexity Token Learning [61.99353167168545]
We show that fine-tuning with LLM-generated data improves target task performance and reduces non-target task degradation.<n>This is the first work to provide an empirical explanation based on token perplexity reduction to mitigate catastrophic forgetting in LLMs after fine-tuning.
arXiv Detail & Related papers (2025-01-24T08:18:56Z) - Mentor-KD: Making Small Language Models Better Multi-step Reasoners [15.159415340059388]
We propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs.
We exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations.
We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.
arXiv Detail & Related papers (2024-10-11T17:53:27Z) - LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation [41.05687297326706]
LLaVA-MoD is a framework designed to enable the efficient training of small-scale Multimodal Language Models.
We optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts architecture into the language model.
We also propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration.
arXiv Detail & Related papers (2024-08-28T15:52:23Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - DDK: Distilling Domain Knowledge for Efficient Large Language Models [40.839056203329136]
Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller language model.
This paper introduces DDK, which adjusts the composition of the distillation dataset according to the domain performance differences between the teacher and student models.
Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
arXiv Detail & Related papers (2024-07-23T03:47:28Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Direct Preference Knowledge Distillation for Large Language Models [73.50849692633953]
We propose Direct Preference Knowledge Distillation (DPKD) for large language models (LLMs)
We re-formulate KD of LLMs into two stages: first optimizing and objective consisting of implicit reward and reverse KL divergence.
We prove the value and effectiveness of the introduced implicit reward and output preference in KD through experiments and theoretical analysis.
arXiv Detail & Related papers (2024-06-28T09:23:40Z) - PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z) - Towards Cross-Tokenizer Distillation: the Universal Logit Distillation
Loss for LLMs [12.412075695071529]
Knowledge distillation offers a solution by compressing knowledge from resource-intensive large models to smaller ones.
We introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation.
arXiv Detail & Related papers (2024-02-19T10:37:29Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - Distilling Step-by-Step! Outperforming Larger Language Models with Less
Training Data and Smaller Model Sizes [91.58845026796149]
We introduce Distilling step-by-step, a new mechanism that trains small models that outperform large language models.
We present three findings across 4 NLP benchmarks.
arXiv Detail & Related papers (2023-05-03T17:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.