Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
- URL: http://arxiv.org/abs/2412.14528v2
- Date: Sat, 18 Jan 2025 08:26:11 GMT
- Title: Multi-Level Optimal Transport for Universal Cross-Tokenizer Knowledge Distillation on Language Models
- Authors: Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, Houqiang Li,
- Abstract summary: Multi-Level Optimal Transport (MultiLevelOT) is a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation.<n>Our method aligns the logit distributions of the teacher and the student at both token and sequence levels.<n>At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness.
- Score: 81.74999702045339
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation (KD) has become a prevalent technique for compressing large language models (LLMs). Existing KD methods are constrained by the need for identical tokenizers (i.e., vocabularies) between teacher and student models, limiting their versatility in handling LLMs of different architecture families. In this paper, we introduce the Multi-Level Optimal Transport (MultiLevelOT), a novel approach that advances the optimal transport for universal cross-tokenizer knowledge distillation. Our method aligns the logit distributions of the teacher and the student at both token and sequence levels using diverse cost matrices, eliminating the need for dimensional or token-by-token correspondence. At the token level, MultiLevelOT integrates both global and local information by jointly optimizing all tokens within a sequence to enhance robustness. At the sequence level, we efficiently capture complex distribution structures of logits via the Sinkhorn distance, which approximates the Wasserstein distance for divergence measures. Extensive experiments on tasks such as extractive QA, generative QA, and summarization demonstrate that the MultiLevelOT outperforms state-of-the-art cross-tokenizer KD methods under various settings. Our approach is robust to different student and teacher models across model families, architectures, and parameter sizes. Codes and models are available at https://github.com/2018cx/Multi-Level-OT.
Related papers
- RSQ: Learning from Important Tokens Leads to Better Quantized LLMs [65.5558181902098]
Layer-wise quantization is a key technique for efficiently compressing large models without expensive retraining.
We propose RSQ (Rotate, Scale, then Quantize), which applies rotations to the model to mitigate outliers.
We demonstrate that RSQ consistently outperforms baseline methods across multiple downstream tasks and three model families.
arXiv Detail & Related papers (2025-03-03T18:46:33Z) - CoT2Align: Cross-Chain of Thought Distillation via Optimal Transport Alignment for Language Models with Different Tokenizers [45.59157559718677]
Large Language Models (LLMs) achieve state-of-the-art performance across various NLP tasks but face deployment challenges due to high computational costs and memory constraints.
Knowledge distillation (KD) is a promising solution, transferring knowledge from large teacher models to smaller student models.
We propose CoT2Align, a universal KD framework that integrates Chain-of-Thought (CoT) augmentation and introduces Cross-CoT Alignment to enhance reasoning transfer.
arXiv Detail & Related papers (2025-02-24T03:30:29Z) - Enhancing Cross-Tokenizer Knowledge Distillation with Contextual Dynamical Mapping [85.48043537327258]
Contextual Dynamic Mapping (CDM) is a novel cross-tokenizer distillation framework.
It uses contextual information to enhance sequence alignment precision and dynamically improve vocabulary mapping.
Our method shows significant advantages over existing cross-tokenizer distillation baselines across diverse benchmarks.
arXiv Detail & Related papers (2025-02-16T12:46:07Z) - LC-Protonets: Multi-Label Few-Shot Learning for World Music Audio Tagging [65.72891334156706]
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification.
LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items.
Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music.
arXiv Detail & Related papers (2024-09-17T15:13:07Z) - Token-level Correlation-guided Compression for Efficient Multimodal Document Understanding [54.532578213126065]
Most document understanding methods preserve all tokens within sub-images and treat them equally.
This neglects their different informativeness and leads to a significant increase in the number of image tokens.
We propose Token-level Correlation-guided Compression, a parameter-free and plug-and-play methodology to optimize token processing.
arXiv Detail & Related papers (2024-07-19T16:11:15Z) - HEAL: Brain-inspired Hyperdimensional Efficient Active Learning [13.648600396116539]
We introduce Hyperdimensional Efficient Active Learning (HEAL), a novel Active Learning framework tailored for HDC classification.
HEAL proactively annotates unlabeled data points via uncertainty and diversity-guided acquisition, leading to a more efficient dataset annotation and lowering labor costs.
Our evaluation shows that HEAL surpasses a diverse set of baselines in AL quality and achieves notably faster acquisition than many BNN-powered or diversity-guided AL methods.
arXiv Detail & Related papers (2024-02-17T08:41:37Z) - VECO 2.0: Cross-lingual Language Model Pre-training with
Multi-granularity Contrastive Learning [56.47303426167584]
We propose a cross-lingual pre-trained model VECO2.0 based on contrastive learning with multi-granularity alignments.
Specifically, the sequence-to-sequence alignment is induced to maximize the similarity of the parallel pairs and minimize the non-parallel pairs.
token-to-token alignment is integrated to bridge the gap between synonymous tokens excavated via the thesaurus dictionary from the other unpaired tokens in a bilingual instance.
arXiv Detail & Related papers (2023-04-17T12:23:41Z) - A Variational Hierarchical Model for Neural Cross-Lingual Summarization [85.44969140204026]
Cross-lingual summarization () is to convert a document in one language to a summary in another one.
Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model.
We propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder.
arXiv Detail & Related papers (2022-03-08T02:46:11Z) - Multi-Level Contrastive Learning for Cross-Lingual Alignment [35.33431650608965]
Cross-language pre-trained models such as multilingual BERT (mBERT) have achieved significant performance in various cross-lingual downstream NLP tasks.
This paper proposes a multi-level contrastive learning framework to further improve the cross-lingual ability of pre-trained models.
arXiv Detail & Related papers (2022-02-26T07:14:20Z) - Transfering Hierarchical Structure with Dual Meta Imitation Learning [4.868214177205893]
We propose a hierarchical meta imitation learning method where the high-level network and sub-skills are iteratively meta-learned with model-agnostic meta-learning.
We achieve state-of-the-art few-shot imitation learning performance on the Meta-world citemetaworld benchmark and competitive results on long-horizon tasks of Kitchen environments.
arXiv Detail & Related papers (2022-01-28T08:22:38Z) - HRKD: Hierarchical Relational Knowledge Distillation for Cross-domain
Language Model Compression [53.90578309960526]
Large pre-trained language models (PLMs) have shown overwhelming performances compared with traditional neural network methods.
We propose a hierarchical relational knowledge distillation (HRKD) method to capture both hierarchical and domain relational information.
arXiv Detail & Related papers (2021-10-16T11:23:02Z) - SetMargin Loss applied to Deep Keystroke Biometrics with Circle Packing
Interpretation [67.0845003374569]
This work presents a new deep learning approach for keystroke biometrics based on a novel Distance Metric Learning method (DML)
We prove experimentally the effectiveness of the proposed approach on a challenging task: keystroke biometric identification over a large set of 78,000 subjects.
arXiv Detail & Related papers (2021-09-02T13:26:57Z) - BERT-EMD: Many-to-Many Layer Mapping for BERT Compression with Earth
Mover's Distance [25.229624487344186]
High storage and computational costs obstruct pre-trained language models to be effectively deployed on resource-constrained devices.
We propose a novel BERT distillation method based on many-to-many layer mapping.
Our model can learn from different teacher layers adaptively for various NLP tasks.
arXiv Detail & Related papers (2020-10-13T02:53:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.