Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation
- URL: http://arxiv.org/abs/2509.26497v1
- Date: Tue, 30 Sep 2025 16:40:55 GMT
- Title: Revealing the Power of Post-Training for Small Language Models via Knowledge Distillation
- Authors: Miao Rang, Zhenni Bi, Hang Zhou, Hanting Chen, An Xiao, Tianyu Guo, Kai Han, Xinghao Chen, Yunhe Wang,
- Abstract summary: We introduce a systematic post-training pipeline that efficiently enhances small model accuracy.<n>The resulting instruction-tuned model achieves state-of-the-art performance.<n>This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
- Score: 43.68215777330875
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of large language models (LLMs) has significantly advanced the capabilities of artificial intelligence across various domains. However, their massive scale and high computational costs render them unsuitable for direct deployment in resource-constrained edge environments. This creates a critical need for high-performance small models that can operate efficiently at the edge. Yet, after pre-training alone, these smaller models often fail to meet the performance requirements of complex tasks. To bridge this gap, we introduce a systematic post-training pipeline that efficiently enhances small model accuracy. Our post training pipeline consists of curriculum-based supervised fine-tuning (SFT) and offline on-policy knowledge distillation. The resulting instruction-tuned model achieves state-of-the-art performance among billion-parameter models, demonstrating strong generalization under strict hardware constraints while maintaining competitive accuracy across a variety of tasks. This work provides a practical and efficient solution for developing high-performance language models on Ascend edge devices.
Related papers
- SVTime: Small Time Series Forecasting Models Informed by "Physics" of Large Vision Model Forecasters [86.38433605933515]
Time series AI is crucial for analyzing dynamic web content.<n>Given their energy-intensive training, inference, and hardware demands, using large models as a one-fits-all solution raises serious concerns about carbon footprint and sustainability.<n>This paper introduces SVTime, a novel Small model inspired by large Vision model (LVM) forecasters for long-term Time series forecasting (LTSF)
arXiv Detail & Related papers (2025-10-10T18:42:23Z) - An Effective Training Framework for Light-Weight Automatic Speech Recognition Models [10.295690160466936]
We introduce an efficacious two-step representation learning based approach capable of producing several small sized models from a single large model.<n>Our approach achieves three-fold training speed-up and up to 12.54% word error rate improvement.
arXiv Detail & Related papers (2025-05-22T17:55:09Z) - Gatekeeper: Improving Model Cascades Through Confidence Tuning [42.1160183944637]
We introduce a novel loss function called Gatekeeper for calibrating smaller models in cascade setups.<n>Our approach fine-tunes the smaller model to confidently handle tasks it can perform correctly while deferring complex tasks to the larger model.
arXiv Detail & Related papers (2025-02-26T17:29:08Z) - Super Tiny Language Models [3.8353434814956517]
This paper introduces a series of research efforts focused on Super Tiny Language Models (STLMs)
We explore innovative techniques such as byte-level tokenization with a pooling mechanism, weight tying, and efficient training strategies.
Our ultimate goal is to make high-performance language models more accessible and practical for a wide range of applications.
arXiv Detail & Related papers (2024-05-23T04:12:49Z) - Majority Kernels: An Approach to Leverage Big Model Dynamics for Efficient Small Model Training [32.154166415680066]
Methods like distillation, compression or quantization help leverage the highly performant large models to induce smaller performant ones.
This paper explores the hypothesis that a single training run can simultaneously train a larger model for performance and derive a smaller model for deployment.
arXiv Detail & Related papers (2024-02-07T17:07:41Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Retrieval-based Knowledge Transfer: An Effective Approach for Extreme
Large Language Model Compression [64.07696663255155]
Large-scale pre-trained language models (LLMs) have demonstrated exceptional performance in various natural language processing (NLP) tasks.
However, the massive size of these models poses huge challenges for their deployment in real-world applications.
We introduce a novel compression paradigm called Retrieval-based Knowledge Transfer (RetriKT) which effectively transfers the knowledge of LLMs to extremely small-scale models.
arXiv Detail & Related papers (2023-10-24T07:58:20Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Top-KAST: Top-K Always Sparse Training [50.05611544535801]
We propose Top-KAST, a method that preserves constant sparsity throughout training.
We show that it performs comparably to or better than previous works when training models on the established ImageNet benchmark.
In addition to our ImageNet results, we also demonstrate our approach in the domain of language modeling.
arXiv Detail & Related papers (2021-06-07T11:13:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.