DMT: Comprehensive Distillation with Multiple Self-supervised Teachers
- URL: http://arxiv.org/abs/2312.11938v1
- Date: Tue, 19 Dec 2023 08:31:30 GMT
- Title: DMT: Comprehensive Distillation with Multiple Self-supervised Teachers
- Authors: Yuang Liu, Jing Wang, Qiang Zhou, Fan Wang, Jun Wang, Wei Zhang
- Abstract summary: We introduce Comprehensive Distillation with Multiple Self-supervised Teachers (DMT) for pretrained model compression.
Our experimental results on prominent benchmark datasets exhibit that the proposed method significantly surpasses state-of-the-art competitors.
- Score: 27.037140667247208
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Numerous self-supervised learning paradigms, such as contrastive learning and
masked image modeling, have been proposed to acquire powerful and general
representations from unlabeled data. However, these models are commonly
pretrained within their specific framework alone, failing to consider the
complementary nature of visual representations. To tackle this issue, we
introduce Comprehensive Distillation with Multiple Self-supervised Teachers
(DMT) for pretrained model compression, which leverages the strengths of
multiple off-the-shelf self-supervised models. Our experimental results on
prominent benchmark datasets exhibit that the proposed method significantly
surpasses state-of-the-art competitors while retaining favorable efficiency
metrics. On classification tasks, our DMT framework utilizing three different
self-supervised ViT-Base teachers enhances the performance of both small/tiny
models and the base model itself. For dense tasks, DMT elevates the AP/mIoU of
standard SSL models on MS-COCO and ADE20K datasets by 4.0%.
Related papers
- EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Unlock the Power: Competitive Distillation for Multi-Modal Large
Language Models [17.25135606956287]
Competitive Multi-modal Distillation framework (CoMD) captures bidirectional feedback between teacher and student models.
Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model.
arXiv Detail & Related papers (2023-11-14T14:49:46Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - DisCo: Distilled Student Models Co-training for Semi-supervised Text
Mining [23.418419374791107]
DisCo is a semi-supervised learning framework for fine-tuning a cohort of small student models generated from a large PLM.
We show that DisCo can produce student models that are 7.6 times smaller and 4.8 times faster in inference than the baseline PLMs.
arXiv Detail & Related papers (2023-05-20T03:23:16Z) - Multi-Mode Online Knowledge Distillation for Self-Supervised Visual
Representation Learning [13.057037169495594]
We propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning.
In MOKD, two different models learn collaboratively in a self-supervised manner.
In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models.
arXiv Detail & Related papers (2023-04-13T12:55:53Z) - KDSM: An uplift modeling framework based on knowledge distillation and
sample matching [2.036924568983982]
Uplift modeling aims to estimate the treatment effect on individuals.
Tree-based methods are adept at fitting increment and generalization, while neural-network-based models excel at predicting absolute value and precision.
In this paper, we proposed an uplift modeling framework based on Knowledge Distillation and Sample Matching (KDSM)
arXiv Detail & Related papers (2023-03-06T09:15:28Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Self-Supervised Monocular Depth Estimation with Self-Reference
Distillation and Disparity Offset Refinement [15.012694052674899]
We propose two novel ideas to improve self-supervised monocular depth estimation.
We use a parameter-optimized model as the teacher updated as the training epochs to provide additional supervision.
We leverage the contextual consistency between high-scale and low-scale features to obtain multiscale disparity offsets.
arXiv Detail & Related papers (2023-02-20T06:28:52Z) - CTDS: Centralized Teacher with Decentralized Student for Multi-Agent
Reinforcement Learning [114.69155066932046]
This work proposes a novel.
Teacher with Decentralized Student (C TDS) framework, which consists of a teacher model and a student model.
Specifically, the teacher model allocates the team reward by learning individual Q-values conditioned on global observation.
The student model utilizes the partial observations to approximate the Q-values estimated by the teacher model.
arXiv Detail & Related papers (2022-03-16T06:03:14Z) - Multi-Task Self-Training for Learning General Representations [97.01728635294879]
Multi-task self-training (MuST) harnesses the knowledge in independent specialized teacher models to train a single general student model.
MuST is scalable with unlabeled or partially labeled datasets and outperforms both specialized supervised models and self-supervised models when training on large scale datasets.
arXiv Detail & Related papers (2021-08-25T17:20:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.