Iterative Layer-wise Distillation for Efficient Compression of Large Language Models
- URL: http://arxiv.org/abs/2511.05085v1
- Date: Fri, 07 Nov 2025 09:00:26 GMT
- Title: Iterative Layer-wise Distillation for Efficient Compression of Large Language Models
- Authors: Grigory Kovalev, Mikhail Tikhomirov,
- Abstract summary: This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance.<n>An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance.<n> Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss.
- Score: 0.42970700836450487
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This work investigates distillation methods for large language models (LLMs) with the goal of developing compact models that preserve high performance. Several existing approaches are reviewed, with a discussion of their respective strengths and limitations. An improved method based on the ShortGPT approach has been developed, building upon the idea of incorporating iterative evaluation of layer importance. At each step, importance is assessed by measuring performance degradation when individual layers are removed, using a set of representative datasets. This process is combined with further training using a joint loss function based on KL divergence and mean squared error. Experiments on the Qwen2.5-3B model show that the number of layers can be reduced from 36 to 28 (resulting in a 2.47 billion parameter model) with only a 9.7% quality loss, and to 24 layers with an 18% loss. The findings suggest that the middle transformer layers contribute less to inference, underscoring the potential of the proposed method for creating efficient models. The results demonstrate the effectiveness of iterative distillation and fine-tuning, making the approach suitable for deployment in resource-limited settings.
Related papers
- Efficient Mathematical Reasoning Models via Dynamic Pruning and Knowledge Distillation [2.596115982322528]
This paper proposes a lightweight optimization method that integrates dynamic attention head pruning with knowledge distillation.<n>Experiments conducted on both Math23k and ASDiv-A verify the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-11-15T09:21:44Z) - Synthetic Adaptive Guided Embeddings (SAGE): A Novel Knowledge Distillation Method [1.5839621757142595]
We propose a novel adaptive distillation framework that dynamically augments training data in regions of high student model loss.<n>Our method identifies underperforming regions in the embedding space and generates targeted synthetic examples to guide student learning.
arXiv Detail & Related papers (2025-08-20T15:29:00Z) - Remote Sensing Image Classification with Decoupled Knowledge Distillation [2.698114369639173]
This paper proposes a lightweight classification method based on knowledge distillation.<n>The proposed method achieves nearly equivalent Top-1 accuracy while reducing the number of parameters by 6.24 times.
arXiv Detail & Related papers (2025-05-25T12:06:28Z) - MGD$^3$: Mode-Guided Dataset Distillation using Diffusion Models [50.2406741245418]
We propose a mode-guided diffusion model leveraging a pre-trained diffusion model.<n>Our approach addresses dataset diversity in three stages: Mode Discovery to identify distinct data modes, Mode Guidance to enhance intra-class diversity, and Stop Guidance to mitigate artifacts in synthetic samples.<n>Our method eliminates the need for fine-tuning diffusion models with distillation losses, significantly reducing computational costs.
arXiv Detail & Related papers (2025-05-25T03:40:23Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Determining Layer-wise Sparsity for Large Language Models Through a Theoretical Perspective [55.90119819642064]
We address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective.<n>This refers to the cumulative effect of reconstruction errors throughout the sparsification process.<n>We derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue.
arXiv Detail & Related papers (2025-02-20T17:51:10Z) - Feature Alignment-Based Knowledge Distillation for Efficient Compression of Large Language Models [4.737806982257592]
This study proposes a knowledge distillation algorithm based on large language models and feature alignment.<n>The proposed model performs very close to the state-of-the-art GPT-4 model in terms of evaluation indicators such as perplexity, BLEU, ROUGE, and CER.
arXiv Detail & Related papers (2024-12-27T04:37:06Z) - Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images.
textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z) - LAPTOP-Diff: Layer Pruning and Normalized Distillation for Compressing Diffusion Models [8.679634923220174]
We propose the layer pruning and normalized distillation for compressing diffusion models (LAPTOP-Diff)<n>Using the proposed LAPTOP-Diff, we compressed the U-Nets of SDXL and SDM-v1.5 for the most advanced performance, achieving a minimal 4.0% decline in PickScore at a pruning ratio of 50%.
arXiv Detail & Related papers (2024-04-17T06:32:42Z) - LaCo: Large Language Model Pruning via Layer Collapse [56.92068213969036]
Large language models (LLMs) based on transformer are witnessing a notable trend of size expansion.
Existing methods such as model quantization, knowledge distillation, and model pruning are constrained by various issues.
We propose a concise layer-wise structured pruner called textitLayer Collapse (LaCo), in which rear model layers collapse into a prior layer.
arXiv Detail & Related papers (2024-02-17T04:16:30Z) - LayerCollapse: Adaptive compression of neural networks [13.567747247563108]
Transformer networks outperform prior art in Natural Language processing and Computer Vision.
Models contain hundreds of millions of parameters, demanding significant computational resources.
We present LayerCollapse, a novel structured pruning method to reduce the depth of fully connected layers.
arXiv Detail & Related papers (2023-11-29T01:23:41Z) - ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
Models [70.45441031021291]
Large Vision-Language Models (LVLMs) can understand the world comprehensively by integrating rich information from different modalities.
LVLMs are often problematic due to their massive computational/energy costs and carbon consumption.
We propose Efficient Coarse-to-Fine LayerWise Pruning (ECoFLaP), a two-stage coarse-to-fine weight pruning approach for LVLMs.
arXiv Detail & Related papers (2023-10-04T17:34:00Z) - Layer-wise Linear Mode Connectivity [52.6945036534469]
Averaging neural network parameters is an intuitive method for the knowledge of two independent models.
It is most prominently used in federated learning.
We analyse the performance of the models that result from averaging single, or groups.
arXiv Detail & Related papers (2023-07-13T09:39:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.