Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models
- URL: http://arxiv.org/abs/2110.08536v1
- Date: Sat, 16 Oct 2021 10:04:14 GMT
- Title: Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models
- Authors: Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron
Jaech
- Abstract summary: Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
- Score: 49.8019791766848
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Distilling state-of-the-art transformer models into lightweight student
models is an effective way to reduce computation cost at inference time.
However, the improved inference speed may be still unsatisfactory for certain
time-sensitive applications. In this paper, we aim to further push the limit of
inference speed by exploring a new area in the design space of the student
model. More specifically, we consider distilling a transformer-based text
classifier into a billion-parameter, sparsely-activated student model with a
embedding-averaging architecture. Our experiments show that the student models
retain 97% of the RoBERTa-Large teacher performance on a collection of six text
classification tasks. Meanwhile, the student model achieves up to 600x speed-up
on both GPUs and CPUs, compared to the teacher models. Further investigation
shows that our pipeline is also effective in privacy-preserving and domain
generalization settings.
Related papers
- Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers [0.0]
We propose a robust ensemble of classifiers to serve as Teacher models in image-based parking space classification.
These Teacher models are distilled into lightweight and specialized Student models that can be deployed directly on edge devices.
Our results show that the Student models, with 26 times fewer parameters than the Teacher models, achieved an average accuracy of 96.6% on the target test datasets.
arXiv Detail & Related papers (2024-10-07T20:29:42Z) - General Compression Framework for Efficient Transformer Object Tracking [26.42022701164278]
We propose a general model compression framework for efficient transformer object tracking, named CompressTracker.
Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages.
Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture.
arXiv Detail & Related papers (2024-09-26T06:27:15Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Towards a Smaller Student: Capacity Dynamic Distillation for Efficient
Image Retrieval [49.01637233471453]
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference.
We propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity.
Our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher.
arXiv Detail & Related papers (2023-03-16T11:09:22Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Comparison of Soft and Hard Target RNN-T Distillation for Large-scale
ASR [12.953149757081025]
We focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR)
We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student.
For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech using Noisy Student Training with soft target distillation.
arXiv Detail & Related papers (2022-10-11T21:32:34Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.