Related papers: Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments

URL: http://arxiv.org/abs/2509.22319v2
Date: Wed, 01 Oct 2025 13:53:12 GMT
Title: Progressive Weight Loading: Accelerating Initial Inference and Gradually Boosting Performance on Resource-Constrained Environments
Authors: Hyunwoo Kim, Junha Lee, Mincheol Choi, Jeonghwan Lee, Jaeshin Cho,
Abstract summary: Progressive Weight Loading (PWL) is a technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model.<n>Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model.
Score: 8.020686883632594
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Deep learning models have become increasingly large and complex, resulting in higher memory consumption and computational demands. Consequently, model loading times and initial inference latency have increased, posing significant challenges in mobile and latency-sensitive environments where frequent model loading and unloading are required, which directly impacts user experience. While Knowledge Distillation (KD) offers a solution by compressing large teacher models into smaller student ones, it often comes at the cost of reduced performance. To address this trade-off, we propose Progressive Weight Loading (PWL), a novel technique that enables fast initial inference by first deploying a lightweight student model, then incrementally replacing its layers with those of a pre-trained teacher model. To support seamless layer substitution, we introduce a training method that not only aligns intermediate feature representations between student and teacher layers, but also improves the overall output performance of the student model. Our experiments on VGG, ResNet, and ViT architectures demonstrate that models trained with PWL maintain competitive distillation performance and gradually improve accuracy as teacher layers are loaded-matching the final accuracy of the full teacher model without compromising initial inference speed. This makes PWL particularly suited for dynamic, resource-constrained deployments where both responsiveness and performance are critical.

Related papers

ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models [14.202025149504715]
We present ActDistill, a framework that transfers the action prediction capability of any existing VLA model to a lightweight counterpart.<n>We employ a well-trained VLA model as the teacher and introduce a graph-structured encapsulation strategy to explicitly model the hierarchical evolution of action prediction.<n>Experiments on embodied benchmarks demonstrate that ActDistill achieves comparable or superior performance to full-scale VLA models while reducing computation by over 50% with up to 1.67 times speedup.
arXiv Detail & Related papers (2025-11-22T14:44:03Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
FlowDistill: Scalable Traffic Flow Prediction via Distillation from LLMs [5.6685153523382015]
FlowDistill is a lightweight traffic prediction framework based on knowledge distillation from large language models (LLMs)<n>Despite its simplicity, FlowDistill consistently outperforms state-of-the-art models in prediction accuracy while requiring significantly less training data.
arXiv Detail & Related papers (2025-04-02T19:54:54Z)
CustomKD: Customizing Large Vision Foundation for Edge Model Improvement via Knowledge Distillation [57.91828170220308]
We propose a knowledge distillation approach, CustomKD, that effectively leverages large vision foundation models (LVFMs) to enhance the performance of edge models.<n>Our simple yet effective CustomKD customizes the well-generalized features inherent in LVFMs to a given student model in order to reduce model discrepancies.
arXiv Detail & Related papers (2025-03-23T23:53:08Z)
Over-parameterized Student Model via Tensor Decomposition Boosted Knowledge Distillation [10.48108719012248]
We focus on Knowledge Distillation (KD), where a compact student model is trained to mimic a larger teacher model. In contrast to much of the previous work, we scale up the parameters of the student model during training.
arXiv Detail & Related papers (2024-11-10T12:40:59Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Continual Learners are Incremental Model Generalizers [70.34479702177988]
This paper extensively studies the impact of Continual Learning (CL) models as pre-trainers. We find that the transfer quality of the representation often increases gradually without noticeable degradation in fine-tuning performance. We propose a new fine-tuning scheme, GLobal Attention Discretization (GLAD), that preserves rich task-generic representation during solving downstream tasks.
arXiv Detail & Related papers (2023-06-21T05:26:28Z)
Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval [49.01637233471453]
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference. We propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity. Our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher.
arXiv Detail & Related papers (2023-03-16T11:09:22Z)
DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization [75.72231742114951]
Large-scale pre-trained sequence-to-sequence models like BART and T5 achieve state-of-the-art performance on many generative NLP tasks. These models pose a great challenge in resource-constrained scenarios owing to their large memory requirements and high latency. We propose to jointly distill and quantize the model, where knowledge is transferred from the full-precision teacher model to the quantized and distilled low-precision student model.
arXiv Detail & Related papers (2022-03-21T18:04:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.