DisCo: Distilled Student Models Co-training for Semi-supervised Text
Mining
- URL: http://arxiv.org/abs/2305.12074v3
- Date: Fri, 20 Oct 2023 02:11:16 GMT
- Title: DisCo: Distilled Student Models Co-training for Semi-supervised Text
Mining
- Authors: Weifeng Jiang, Qianren Mao, Chenghua Lin, Jianxin Li, Ting Deng, Weiyi
Yang and Zheng Wang
- Abstract summary: DisCo is a semi-supervised learning framework for fine-tuning a cohort of small student models generated from a large PLM.
We show that DisCo can produce student models that are 7.6 times smaller and 4.8 times faster in inference than the baseline PLMs.
- Score: 23.418419374791107
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many text mining models are constructed by fine-tuning a large deep
pre-trained language model (PLM) in downstream tasks. However, a significant
challenge nowadays is maintaining performance when we use a lightweight model
with limited labelled samples. We present DisCo, a semi-supervised learning
(SSL) framework for fine-tuning a cohort of small student models generated from
a large PLM using knowledge distillation. Our key insight is to share
complementary knowledge among distilled student cohorts to promote their SSL
effectiveness. DisCo employs a novel co-training technique to optimize a cohort
of multiple small student models by promoting knowledge sharing among students
under diversified views: model views produced by different distillation
strategies and data views produced by various input augmentations. We evaluate
DisCo on both semi-supervised text classification and extractive summarization
tasks. Experimental results show that DisCo can produce student models that are
7.6 times smaller and 4.8 times faster in inference than the baseline PLMs
while maintaining comparable performance. We also show that DisCo-generated
student models outperform the similar-sized models elaborately tuned in
distinct tasks.
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Interactive DualChecker for Mitigating Hallucinations in Distilling Large Language Models [7.632217365130212]
Large Language Models (LLMs) have demonstrated exceptional capabilities across various machine learning (ML) tasks.
These models can produce hallucinations, particularly in domains with incomplete knowledge.
We introduce DualChecker, an innovative framework designed to mitigate hallucinations and improve the performance of both teacher and student models.
arXiv Detail & Related papers (2024-08-22T12:04:04Z) - LLAVADI: What Matters For Multimodal Large Language Models Distillation [77.73964744238519]
In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch.
Our studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process.
By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters.
arXiv Detail & Related papers (2024-07-28T06:10:47Z) - Improving In-context Learning via Bidirectional Alignment [41.214003703218914]
Large language models (LLMs) have shown impressive few-shot generalization on many tasks via in-context learning (ICL)
We propose Bidirectional Alignment (BiAlign) to fully leverage the models' preferences for ICL examples to improve the ICL abilities of student models.
Specifically, we introduce the alignment of input preferences between student and teacher models by incorporating a novel ranking loss.
arXiv Detail & Related papers (2023-12-28T15:02:03Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models.
We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER.
Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.