Related papers: Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings

URL: http://arxiv.org/abs/2602.20164v1
Date: Wed, 28 Jan 2026 15:27:09 GMT
Title: Benchmarking Distilled Language Models: Performance and Efficiency in Resource-Constrained Settings
Authors: Sachin Gopal Wani, Eric Page, Ajay Dholakia, David Ellison,
Abstract summary: We benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts.<n>We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart.
Score: 0.5399800035598185
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge distillation offers a transformative pathway to developing powerful, yet efficient, small language models (SLMs) suitable for resource-constrained environments. In this paper, we benchmark the performance and computational cost of distilled models against their vanilla and proprietary counterparts, providing a quantitative analysis of their efficiency. Our results demonstrate that distillation creates a superior performance-tocompute curve. We find that creating a distilled 8B model is over 2,000 times more compute-efficient than training its vanilla counterpart, while achieving reasoning capabilities on par with, or even exceeding, standard models ten times its size. These findings validate distillation not just as a compression technique, but as a primary strategy for building state-of-the-art, accessible AI

Related papers

On-Policy Context Distillation for Language Models [92.82835176360864]
We propose On-Policy Context Distillation (OPCD), a framework that bridges on-policy distillation with context distillation.<n>We demonstrate the effectiveness of OPCD on two important applications: experiential knowledge distillation and system prompt distillation.
arXiv Detail & Related papers (2026-02-12T18:58:28Z)
Towards Distillation-Resistant Large Language Models: An Information-Theoretic Perspective [52.25797439810419]
Existing defenses focus exclusively on text-based distillation, leaving the important logit-based distillation largely unexplored.<n>We characterize distillation-relevant information in teacher outputs using the conditional mutual information (CMI) between teacher logits and input queries conditioned on ground-truth labels.<n>We derive a CMI-inspired anti-distillation objective to optimize this transformation, which effectively removes distillation-relevant information while preserving output utility.
arXiv Detail & Related papers (2026-02-03T11:16:59Z)
Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method [1.5839621757142595]
We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss.<n>Our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning.
arXiv Detail & Related papers (2025-08-20T15:29:00Z)
Honey, I Shrunk the Language Model: Impact of Knowledge Distillation Methods on Performance and Explainability [3.224880576815583]
High computational and storage demands of Large Language Models limit their deployment in resource-constrained environments.<n>Previous research has introduced several distillation methods for both generating training data and for training the student model.<n>Despite their relevance, the effects of state-of-the-art distillation methods on model performance and explainability have not been thoroughly investigated.
arXiv Detail & Related papers (2025-04-22T17:32:48Z)
Numerical Pruning for Efficient Autoregressive Models [87.56342118369123]
This paper focuses on compressing decoder-only transformer-based autoregressive models through structural weight pruning.<n>Specifically, we propose a training-free pruning method that calculates a numerical score with Newton's method for the Attention and modules, respectively.<n>To verify the effectiveness of our method, we provide both theoretical support and extensive experiments.
arXiv Detail & Related papers (2024-12-17T01:09:23Z)
Efficient Point Cloud Classification via Offline Distillation Framework and Negative-Weight Self-Distillation Technique [46.266960248570086]
We introduce an innovative offline recording strategy that avoids the simultaneous loading of both teacher and student models. This approach feeds a multitude of augmented samples into the teacher model, recording both the data augmentation parameters and the corresponding logit outputs. Experimental results demonstrate that the proposed distillation strategy enables the student model to achieve performance comparable to state-of-the-art models.
arXiv Detail & Related papers (2024-09-03T16:12:12Z)
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination [28.061239778773423]
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks.<n>CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources.<n>We introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model.
arXiv Detail & Related papers (2024-08-18T11:23:21Z)
BOOT: Data-free Distillation of Denoising Diffusion Models with Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images. Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few. We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z)
Too Brittle To Touch: Comparing the Stability of Quantization and Distillation Towards Developing Lightweight Low-Resource MT Models [12.670354498961492]
State-of-the-art machine translation models are often able to adapt to the paucity of data for low-resource languages. Knowledge Distillation is one popular technique to develop competitive, lightweight models.
arXiv Detail & Related papers (2022-10-27T05:30:13Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.