Towards Cross-Tokenizer Distillation: the Universal Logit Distillation
Loss for LLMs
- URL: http://arxiv.org/abs/2402.12030v2
- Date: Tue, 20 Feb 2024 14:46:03 GMT
- Title: Towards Cross-Tokenizer Distillation: the Universal Logit Distillation
Loss for LLMs
- Authors: Nicolas Boizard, Kevin El Haddad, C\'eline Hudelot, Pierre Colombo
- Abstract summary: Knowledge distillation offers a solution by compressing knowledge from resource-intensive large models to smaller ones.
We introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation.
- Score: 12.412075695071529
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LLMs) of several billion parameters can be
impractical in most industrial use cases due to constraints such as cost,
latency limitations, and hardware accessibility. Knowledge distillation (KD)
offers a solution by compressing knowledge from resource-intensive large models
to smaller ones. Various strategies exist, some relying on the text generated
by the teacher model and optionally utilizing his logits to enhance learning.
However, these methods based on logits often require both teacher and student
models to share the same tokenizer, limiting their applicability across
different LLM families. In this paper, we introduce Universal Logit
Distillation (ULD) loss, grounded in optimal transport, to address this
limitation. Our experimental results demonstrate the effectiveness of ULD loss
in enabling distillation across models with different architectures and
tokenizers, paving the way to a more widespread use of distillation techniques.
Related papers
- PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning [54.73049408950049]
We propose a Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning.<n>Our approach improves unified multimodal retrieval from both structural and learning perspectives.
arXiv Detail & Related papers (2025-07-10T16:47:25Z) - Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.
Previous research has introduced several distillation methods for both generating training data and for training the student model.
Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z) - DistilQwen2.5: Industrial Practices of Training Distilled Open Lightweight Language Models [10.34623505096336]
We present DistilQwen2.5, a family of distilled, lightweight language models (LLMs) derived from the public Qwen2.5 models.
These models exhibit enhanced instruction-following capabilities compared to the original models.
To facilitate practical use, we have released all the DistilQwen2.5 models to the open-source community.
arXiv Detail & Related papers (2025-04-21T11:26:02Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.
Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - LLaVA-KD: A Framework of Distilling Multimodal Large Language Models [72.68665884790002]
We propose a novel framework to transfer knowledge from l-MLLMs to s-MLLMs.<n>We introduce Multimodal Distillation (MDist) to transfer teacher model's robust representations across both visual and linguistic modalities.<n>We also propose a three-stage training scheme to fully exploit the potential of the proposed distillation strategy.
arXiv Detail & Related papers (2024-10-21T17:41:28Z) - Distillation-Free One-Step Diffusion for Real-World Image Super-Resolution [81.81748032199813]
We propose a Distillation-Free One-Step Diffusion model.
Specifically, we propose a noise-aware discriminator (NAD) to participate in adversarial training.
We improve the perceptual loss with edge-aware DISTS (EA-DISTS) to enhance the model's ability to generate fine details.
arXiv Detail & Related papers (2024-10-05T16:41:36Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - DDK: Distilling Domain Knowledge for Efficient Large Language Models [40.839056203329136]
Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller language model.
This paper introduces DDK, which adjusts the composition of the distillation dataset according to the domain performance differences between the teacher and student models.
Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
arXiv Detail & Related papers (2024-07-23T03:47:28Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation [4.577173950430005]
Large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks.
Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model.
In this paper, we explore the task-specific distillation of LLMs at the logit level.
arXiv Detail & Related papers (2024-06-19T13:44:56Z) - PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z) - AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting [5.818420448447701]
We propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level.
Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives.
arXiv Detail & Related papers (2024-05-11T15:06:24Z) - ELAD: Explanation-Guided Large Language Models Active Distillation [16.243249111524403]
The deployment and application of Large Language Models (LLMs) is hindered by their memory inefficiency, computational demands, and the high costs of API inferences.
Traditional distillation methods, which transfer the capabilities of LLMs to smaller models, often fail to determine whether the knowledge has been sufficiently transferred.
We propose an Explanation-Guided LLMs Active Distillation (ELAD) framework that employs an active learning strategy to optimize the balance between annotation costs and model performance.
arXiv Detail & Related papers (2024-02-20T15:47:59Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation [91.05073136215886]
"Actor-Learner Distillation" transfers learning progress from a large capacity learner model to a small capacity actor model.
We demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model.
arXiv Detail & Related papers (2021-04-04T17:56:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.