A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
- URL: http://arxiv.org/abs/2505.12781v1
- Date: Mon, 19 May 2025 07:10:42 GMT
- Title: A Token is Worth over 1,000 Tokens: Efficient Knowledge Distillation through Low-Rank Clone
- Authors: Jitai Hao, Qiang Huang, Hao Liu, Xinyan Xiao, Zhaochun Ren, Jun Yu,
- Abstract summary: Low-Rank Clone (LRC) is an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models.<n>LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency.
- Score: 43.277946885969726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Training high-performing Small Language Models (SLMs) remains costly, even with knowledge distillation and pruning from larger teacher models. Existing work often faces three key challenges: (1) information loss from hard pruning, (2) inefficient alignment of representations, and (3) underutilization of informative activations, particularly from Feed-Forward Networks (FFNs). To address these challenges, we introduce Low-Rank Clone (LRC), an efficient pre-training method that constructs SLMs aspiring to behavioral equivalence with strong teacher models. LRC trains a set of low-rank projection matrices that jointly enable soft pruning by compressing teacher weights, and activation clone by aligning student activations, including FFN signals, with those of the teacher. This unified design maximizes knowledge transfer while removing the need for explicit alignment modules. Extensive experiments with open-source teachers (e.g., Llama-3.2-3B-Instruct, Qwen2.5-3B/7B-Instruct) show that LRC matches or surpasses state-of-the-art models trained on trillions of tokens--while using only 20B tokens, achieving over 1,000x training efficiency. Our codes and model checkpoints are available at https://github.com/CURRENTF/LowRankClone and https://huggingface.co/collections/JitaiHao/low-rank-clone-lrc-6828389e96a93f1d4219dfaf.
Related papers
- ReGATE: Learning Faster and Better with Fewer Tokens in MLLMs [1.1834200163382398]
ReGATE (Reference$-$Guided Adaptive Token Elision) is an adaptive token pruning method for accelerating MLLM training.<n>It matches the peak accuracy of standard training on MVBench up to 2$times$ faster, using only 35% of the tokens.
arXiv Detail & Related papers (2025-07-29T01:07:09Z) - MiniPLM: Knowledge Distillation for Pre-Training Language Models [109.83741809808483]
MiniPLM is a framework for pre-training student language models (LMs) using large teacher LMs.<n>For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs.<n>For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families.
arXiv Detail & Related papers (2024-10-22T17:40:32Z) - Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs [25.91643745340183]
Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora.<n>This poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods.<n>We propose Low-rank Knowledge Unlearning (LoKU), a novel framework that enables robust and efficient unlearning for LLMs.
arXiv Detail & Related papers (2024-08-13T04:18:32Z) - Distillation Matters: Empowering Sequential Recommenders to Match the Performance of Large Language Model [12.6937643116018]
Large Language Models (LLMs) have been effectively utilized as recommenders, achieving impressive performance.
However, the high inference latency of LLMs significantly restricts their practical deployment.
This work investigates knowledge distillation from cumbersome LLM-based recommendation models to lightweight sequential models.
arXiv Detail & Related papers (2024-05-01T06:23:54Z) - Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs)
We propose a new RL method named RLMEC that incorporates a generative model as the reward model.
Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z) - Instruction Distillation Makes Large Language Models Efficient Zero-shot
Rankers [56.12593882838412]
We introduce a novel instruction distillation method to rank documents.
We first rank documents using the effective pairwise approach with complex instructions, and then distill the teacher predictions to the pointwise approach with simpler instructions.
Our approach surpasses the performance of existing supervised methods like monoT5 and is on par with the state-of-the-art zero-shot methods.
arXiv Detail & Related papers (2023-11-02T19:16:21Z) - Lion: Adversarial Distillation of Proprietary Large Language Models [16.245052771463044]
We propose a novel adversarial distillation framework for a more efficient knowledge transfer.
We successfully transfer knowledge from ChatGPT to a student model (named Lion) using a mere 70k training data.
arXiv Detail & Related papers (2023-05-22T09:49:16Z) - Explicit Knowledge Transfer for Weakly-Supervised Code Generation [14.758396460685017]
We propose explicit knowledge transfer (EKT) to transfer the code generation ability of an LLM to a smaller model.
EKT uses the few-shot capabilities of a teacher LLM to create NL-code pairs that we then filter for correctness and fine-tune the student on.
We find that EKT not only yields better performance than training with expert iteration, but also outperforms knowledge distillation.
arXiv Detail & Related papers (2022-11-30T04:51:26Z) - L2B: Learning to Bootstrap Robust Models for Combating Label Noise [52.02335367411447]
This paper introduces a simple and effective method, named Learning to Bootstrap (L2B)
It enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels.
It achieves this by dynamically adjusting the importance weight between real observed and generated labels, as well as between different samples through meta-learning.
arXiv Detail & Related papers (2022-02-09T05:57:08Z) - Efficient Crowd Counting via Structured Knowledge Transfer [122.30417437707759]
Crowd counting is an application-oriented task and its inference efficiency is crucial for real-world applications.
We propose a novel Structured Knowledge Transfer framework to generate a lightweight but still highly effective student network.
Our models obtain at least 6.5$times$ speed-up on an Nvidia 1080 GPU and even achieve state-of-the-art performance.
arXiv Detail & Related papers (2020-03-23T08:05:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.