Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
- URL: http://arxiv.org/abs/2503.20083v4
- Date: Fri, 24 Oct 2025 08:49:40 GMT
- Title: Universal Cross-Tokenizer Distillation via Approximate Likelihood Matching
- Authors: Benjamin Minixhofer, Ivan Vulić, Edoardo Maria Ponti,
- Abstract summary: Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM.<n>Current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs.<n>We develop a principled cross-tokenizer distillation method to solve this crucial deficiency.<n>Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases.
- Score: 16.385782508179364
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Distillation has shown remarkable success in transferring knowledge from a Large Language Model (LLM) teacher to a student LLM. However, current distillation methods require similar tokenizers between the teacher and the student, restricting their applicability to only a small subset of teacher-student pairs. In this work, we develop a principled cross-tokenizer distillation method to solve this crucial deficiency. Our method is the first to enable effective distillation across fundamentally different tokenizers, while also substantially outperforming prior methods in all other cases. We verify the efficacy of our method on three distinct use cases. First, we show that viewing tokenizer transfer as self-distillation enables unprecedentedly effective transfer across tokenizers, including rapid transfer of subword models to the byte-level. Transferring different models to the same tokenizer also enables ensembling to boost performance. Secondly, we distil a large maths-specialised LLM into a small general-purpose model with a different tokenizer, achieving competitive maths problem-solving performance. Thirdly, we use our method to train state-of-the-art embedding prediction hypernetworks for training-free tokenizer transfer. Our results unlock an expanded range of teacher-student pairs for distillation, enabling new ways to adapt and enhance interaction between LLMs.
Related papers
- Protecting Language Models Against Unauthorized Distillation through Trace Rewriting [31.05181251141126]
Authoritarian use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models.<n>We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence.<n>Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance.
arXiv Detail & Related papers (2026-02-16T19:40:07Z) - AdaSwitch: Adaptive Switching Generation for Knowledge Distillation [58.647880811071495]
Small language models (SLMs) are crucial for applications with strict latency and computational constraints.<n>We propose AdaSwitch, a novel approach that combines on-policy and off-policy generation at the token level.<n>AdaSwitch consistently improves accuracy, offering a practical and effective method for distilling SLMs with acceptable additional overhead.
arXiv Detail & Related papers (2025-10-09T06:38:37Z) - Intra-class Patch Swap for Self-Distillation [3.282914142012984]
We propose a teacher-free distillation framework based on a single student network.<n>Our approach is built on a simple yet highly effective augmentation, called intra-class patch swap augmentation.<n>Our method consistently outperforms both existing self-distillation baselines and conventional teacher-based KD approaches.
arXiv Detail & Related papers (2025-05-20T09:30:19Z) - Swapped Logit Distillation via Bi-level Teacher Alignment [32.746586492281104]
Knowledge distillation (KD) compresses the network capacity by transferring knowledge from a large (teacher) network to a smaller one (student)
We propose a logit-based distillation via swapped logit processing, namely Swapped Logit Distillation (SLD)
We find that SLD consistently performs best among previous state-of-the-art methods.
arXiv Detail & Related papers (2025-04-27T15:52:07Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation [84.38105530043741]
We propose Warmup-Distill, which aligns the distillation of the student to that of the teacher in advance of distillation.<n>Experiments on the seven benchmarks demonstrate that Warmup-Distill could provide a warmup student more suitable for distillation.
arXiv Detail & Related papers (2025-02-17T12:58:12Z) - Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping [85.48043537327258]
Contextual Dynamic Mapping (CDM) is a novel cross-tokenizer distillation framework.<n>It uses contextual information to enhance sequence alignment precision and dynamically improve vocabulary mapping.<n>Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks.
arXiv Detail & Related papers (2025-02-16T12:46:07Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z) - Enhancing In-Context Learning via Implicit Demonstration Augmentation [26.78252788538567]
In-context learning (ICL) enables pre-trained language models to make predictions for unseen inputs without updating parameters.
Despite its potential, ICL's effectiveness heavily relies on the quality, quantity, and permutation of demonstrations.
In this paper, we tackle this challenge for the first time from the perspective of demonstration augmentation.
arXiv Detail & Related papers (2024-06-27T05:25:46Z) - AICSD: Adaptive Inter-Class Similarity Distillation for Semantic
Segmentation [12.92102548320001]
This paper proposes a novel method called Inter-Class Similarity Distillation (ICSD) for the purpose of knowledge distillation.
The proposed method transfers high-order relations from the teacher network to the student network by independently computing intra-class distributions for each class from network outputs.
Experiments conducted on two well-known datasets for semantic segmentation, Cityscapes and Pascal VOC 2012, validate the effectiveness of the proposed method.
arXiv Detail & Related papers (2023-08-08T13:17:20Z) - Hybrid Distillation: Connecting Masked Autoencoders with Contrastive
Learners [102.20090188997301]
We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths.
In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy.
Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
arXiv Detail & Related papers (2023-06-28T02:19:35Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - It's All in the Head: Representation Knowledge Distillation through
Classifier Sharing [0.29360071145551075]
We introduce two approaches for enhancing representation distillation using classifier sharing between the teacher and student.
We show the effectiveness of the proposed methods on various datasets and tasks, including image classification, fine-grained classification, and face verification.
arXiv Detail & Related papers (2022-01-18T13:10:36Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z) - Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Self-supervised Knowledge Distillation for Few-shot Learning [123.10294801296926]
Few shot learning is a promising learning paradigm due to its ability to learn out of order distributions quickly with only a few samples.
We propose a simple approach to improve the representation capacity of deep neural networks for few-shot learning tasks.
Our experiments show that, even in the first stage, self-supervision can outperform current state-of-the-art methods.
arXiv Detail & Related papers (2020-06-17T11:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.