Towards Cross-Tokenizer Distillation: the Universal Logit Distillation
Loss for LLMs
- URL: http://arxiv.org/abs/2402.12030v2
- Date: Tue, 20 Feb 2024 14:46:03 GMT
- Title: Towards Cross-Tokenizer Distillation: the Universal Logit Distillation
Loss for LLMs
- Authors: Nicolas Boizard, Kevin El Haddad, C\'eline Hudelot, Pierre Colombo
- Abstract summary: Knowledge distillation offers a solution by compressing knowledge from resource-intensive large models to smaller ones.
We introduce Universal Logit Distillation (ULD) loss, grounded in optimal transport, to address this limitation.
- Score: 12.412075695071529
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deploying large language models (LLMs) of several billion parameters can be
impractical in most industrial use cases due to constraints such as cost,
latency limitations, and hardware accessibility. Knowledge distillation (KD)
offers a solution by compressing knowledge from resource-intensive large models
to smaller ones. Various strategies exist, some relying on the text generated
by the teacher model and optionally utilizing his logits to enhance learning.
However, these methods based on logits often require both teacher and student
models to share the same tokenizer, limiting their applicability across
different LLM families. In this paper, we introduce Universal Logit
Distillation (ULD) loss, grounded in optimal transport, to address this
limitation. Our experimental results demonstrate the effectiveness of ULD loss
in enabling distillation across models with different architectures and
tokenizers, paving the way to a more widespread use of distillation techniques.
Related papers
- DDK: Distilling Domain Knowledge for Efficient Large Language Models [40.839056203329136]
Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller language model.
This paper introduces DDK, which adjusts the composition of the distillation dataset according to the domain performance differences between the teacher and student models.
Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.
arXiv Detail & Related papers (2024-07-23T03:47:28Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - BiLD: Bi-directional Logits Difference Loss for Large Language Model Distillation [4.577173950430005]
Large language models (LLMs) have shown exceptional capabilities across various natural language processing (NLP) tasks.
Knowledge distillation (KD) provides a solution by transferring knowledge from a large teacher model to a smaller student model.
In this paper, we explore the task-specific distillation of LLMs at the logit level.
arXiv Detail & Related papers (2024-06-19T13:44:56Z) - PLaD: Preference-based Large Language Model Distillation with Pseudo-Preference Pairs [47.35598271306371]
Large Language Models (LLMs) have exhibited impressive capabilities in various tasks, yet their vast parameter sizes restrict their applicability in resource-constrained settings.
Knowledge distillation (KD) offers a viable solution by transferring expertise from large teacher models to compact student models.
We present PLaD, a novel preference-based LLM distillation framework.
arXiv Detail & Related papers (2024-06-05T03:08:25Z) - AdaKD: Dynamic Knowledge Distillation of ASR models using Adaptive Loss Weighting [5.818420448447701]
We propose Adaptive Knowledge Distillation, a novel technique inspired by curriculum learning to adaptively weigh the losses at instance level.
Our method follows a plug-and-play paradigm that can be applied on top of any task-specific and distillation objectives.
arXiv Detail & Related papers (2024-05-11T15:06:24Z) - ELAD: Explanation-Guided Large Language Models Active Distillation [16.243249111524403]
The deployment and application of Large Language Models (LLMs) is hindered by their memory inefficiency, computational demands, and the high costs of API inferences.
Traditional distillation methods, which transfer the capabilities of LLMs to smaller models, often fail to determine whether the knowledge has been sufficiently transferred.
We propose an Explanation-Guided LLMs Active Distillation (ELAD) framework that employs an active learning strategy to optimize the balance between annotation costs and model performance.
arXiv Detail & Related papers (2024-02-20T15:47:59Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - MinT: Boosting Generalization in Mathematical Reasoning via Multi-View
Fine-Tuning [53.90744622542961]
Reasoning in mathematical domains remains a significant challenge for small language models (LMs)
We introduce a new method that exploits existing mathematical problem datasets with diverse annotation styles.
Experimental results show that our strategy enables a LLaMA-7B model to outperform prior approaches.
arXiv Detail & Related papers (2023-07-16T05:41:53Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Efficient Transformers in Reinforcement Learning using Actor-Learner
Distillation [91.05073136215886]
"Actor-Learner Distillation" transfers learning progress from a large capacity learner model to a small capacity actor model.
We demonstrate in several challenging memory environments that using Actor-Learner Distillation recovers the clear sample-efficiency gains of the transformer learner model.
arXiv Detail & Related papers (2021-04-04T17:56:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.