Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models
- URL: http://arxiv.org/abs/2110.08536v1
- Date: Sat, 16 Oct 2021 10:04:14 GMT
- Title: Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models
- Authors: Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron
Jaech
- Abstract summary: Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
- Score: 49.8019791766848
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Distilling state-of-the-art transformer models into lightweight student
models is an effective way to reduce computation cost at inference time.
However, the improved inference speed may be still unsatisfactory for certain
time-sensitive applications. In this paper, we aim to further push the limit of
inference speed by exploring a new area in the design space of the student
model. More specifically, we consider distilling a transformer-based text
classifier into a billion-parameter, sparsely-activated student model with a
embedding-averaging architecture. Our experiments show that the student models
retain 97% of the RoBERTa-Large teacher performance on a collection of six text
classification tasks. Meanwhile, the student model achieves up to 600x speed-up
on both GPUs and CPUs, compared to the teacher models. Further investigation
shows that our pipeline is also effective in privacy-preserving and domain
generalization settings.
Related papers
- Improving Neural Topic Models with Wasserstein Knowledge Distillation [0.8962460460173959]
We propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality.
Experiments show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model.
arXiv Detail & Related papers (2023-03-27T16:07:44Z) - Towards a Smaller Student: Capacity Dynamic Distillation for Efficient
Image Retrieval [49.01637233471453]
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference.
We propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity.
Our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher.
arXiv Detail & Related papers (2023-03-16T11:09:22Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Comparison of Soft and Hard Target RNN-T Distillation for Large-scale
ASR [12.953149757081025]
We focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR)
We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student.
For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech using Noisy Student Training with soft target distillation.
arXiv Detail & Related papers (2022-10-11T21:32:34Z) - Extreme compression of sentence-transformer ranker models: faster
inference, longer battery life, and less storage on edge devices [1.3854111346209868]
We propose two extensions for a popular sentence-transformer distillation procedure to reduce memory requirements and energy consumption.
We evaluate these extensions on two different types of ranker models.
arXiv Detail & Related papers (2022-06-29T08:07:09Z) - Ultra Fast Speech Separation Model with Teacher Student Learning [44.71171732510265]
An ultra fast Transformer model is proposed to achieve better performance and efficiency with teacher student learning (T-S learning)
Compared with the small Transformer model trained from scratch, the proposed T-S learning method reduces the word error rate (WER) by more than 5% for both multi-channel and single-channel speech separation.
arXiv Detail & Related papers (2022-04-27T09:02:45Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Autoregressive Knowledge Distillation through Imitation Learning [70.12862707908769]
We develop a compression technique for autoregressive models driven by an imitation learning perspective on knowledge distillation.
Our method consistently outperforms other distillation algorithms, such as sequence-level knowledge distillation.
Student models trained with our method attain 1.4 to 4.8 BLEU/ROUGE points higher than those trained from scratch, while increasing inference speed by up to 14 times in comparison to the teacher model.
arXiv Detail & Related papers (2020-09-15T17:43:02Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.