Related papers: AMD: Automatic Multi-step Distillation of Large-scale Vision Models

AMD: Automatic Multi-step Distillation of Large-scale Vision Models

URL: http://arxiv.org/abs/2407.04208v1
Date: Fri, 5 Jul 2024 01:35:42 GMT
Title: AMD: Automatic Multi-step Distillation of Large-scale Vision Models
Authors: Cheng Han, Qifan Wang, Sohail A. Dianat, Majid Rabbani, Raghuveer M. Rao, Yi Fang, Qiang Guan, Lifu Huang, Dongfang Liu,
Abstract summary: We present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance.
Score: 39.70559487432038
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based architectures have become the de-facto standard models for diverse vision tasks owing to their superior performance. As the size of the models continues to scale up, model distillation becomes extremely important in various real applications, particularly on devices limited by computational resources. However, prevailing knowledge distillation methods exhibit diminished efficacy when confronted with a large capacity gap between the teacher and the student, e.g, 10x compression rate. In this paper, we present a novel approach named Automatic Multi-step Distillation (AMD) for large-scale vision model compression. In particular, our distillation process unfolds across multiple steps. Initially, the teacher undergoes distillation to form an intermediate teacher-assistant model, which is subsequently distilled further to the student. An efficient and effective optimization framework is introduced to automatically identify the optimal teacher-assistant that leads to the maximal student performance. We conduct extensive experiments on multiple image classification datasets, including CIFAR-10, CIFAR-100, and ImageNet. The findings consistently reveal that our approach outperforms several established baselines, paving a path for future knowledge distillation methods on large-scale vision models.

Related papers

Distilling Lightweight Domain Experts from Large ML Models by Identifying Relevant Subspaces [17.627125013326175]
'SubDistill' is a new distillation algorithm with improved numerical properties that only distills the relevant components of the teacher model at each layer.<n>Our benchmark evaluations are complemented by Explainable AI analyses showing that our distilled student models more closely match the decision structure of the original teacher model.
arXiv Detail & Related papers (2026-01-09T16:28:55Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Multi-Level Decoupled Relational Distillation for Heterogeneous Architectures [6.231548250160585]
Multi-Level Decoupled Knowledge Distillation (MLDR-KD) improves student model performance with gains of up to 4.86% on CodeAR-100 and 2.78% on Tiny-ImageNet datasets respectively.
arXiv Detail & Related papers (2025-02-10T06:41:20Z)
Faithful Label-free Knowledge Distillation [8.572967695281054]
This paper presents a label-free knowledge distillation approach called Teacher in the Middle (TinTeM) It produces a more faithful student, which better replicates the behavior of the teacher network across a range of benchmarks testing model robustness, generalisability and out-of-distribution detection.
arXiv Detail & Related papers (2024-11-22T01:48:44Z)
Progressive Distillation Based on Masked Generation Feature Method for Knowledge Graph Completion [29.297959023968165]
This paper proposes a progressive distillation method based on masked generation features for KGC task. Specifically, we perform pre-distillation on PLM to obtain high-quality teacher models, and compress the PLM network to obtain multi-grade student models. The experimental results demonstrate that the model in the pre-distillation stage surpasses the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-01-19T07:34:36Z)
One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image. Our method enables fully offline training with just noise/image pairs from the diffusion model. We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z)
Education distillation:getting student models to learn in shcools [15.473668050280304]
This paper introduces dynamic incremental learning into knowledge distillation. It is proposed to take fragmented student models divided from the complete student model as lower-grade models.
arXiv Detail & Related papers (2023-11-23T05:20:18Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Efficient Vision Transformers via Fine-Grained Manifold Distillation [96.50513363752836]
Vision transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources. We propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches.
arXiv Detail & Related papers (2021-07-03T08:28:34Z)
Knowledge distillation: A good teacher is patient and consistent [71.14922743774864]
There is a growing discrepancy in computer vision between large-scale models that achieve state-of-the-art performance and models that are affordable in practical applications. We identify certain implicit design choices, which may drastically affect the effectiveness of distillation. We obtain a state-of-the-art ResNet-50 model for ImageNet, which achieves 82.8% top-1 accuracy.
arXiv Detail & Related papers (2021-06-09T17:20:40Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models. We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network. We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.