Related papers: Sparse Distillation: Speeding Up Text Classification by Using Bigger Models

Sparse Distillation: Speeding Up Text Classification by Using Bigger Models

URL: http://arxiv.org/abs/2110.08536v1
Date: Sat, 16 Oct 2021 10:04:14 GMT
Title: Sparse Distillation: Speeding Up Text Classification by Using Bigger Models
Authors: Qinyuan Ye, Madian Khabsa, Mike Lewis, Sinong Wang, Xiang Ren, Aaron Jaech
Abstract summary: Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
Score: 49.8019791766848
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain time-sensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

Related papers

A Novel Approach To Implementing Knowledge Distillation In Tsetlin Machines [0.0]
The Tsetlin Machine (TM) is a propositional logic based model that uses conjunctive clauses to learn patterns from data. We propose a novel approach to implementing knowledge distillation in Tsetlin Machines by utilizing the probability distributions of each output sample in the teacher. We find that our algorithm can significantly improve performance in the student model without negatively impacting latency in the tested domains of image recognition and text classification.
arXiv Detail & Related papers (2025-04-02T15:06:27Z)
Optimizing Parking Space Classification: Distilling Ensembles into Lightweight Classifiers [0.0]
We propose a robust ensemble of classifiers to serve as Teacher models in image-based parking space classification. These Teacher models are distilled into lightweight and specialized Student models that can be deployed directly on edge devices. Our results show that the Student models, with 26 times fewer parameters than the Teacher models, achieved an average accuracy of 96.6% on the target test datasets.
arXiv Detail & Related papers (2024-10-07T20:29:42Z)
General Compression Framework for Efficient Transformer Object Tracking [26.42022701164278]
We propose a general model compression framework for efficient transformer object tracking, named CompressTracker. Our approach features a novel stage division strategy that segments the transformer layers of the teacher model into distinct stages. Our framework CompressTracker is structurally agnostic, making it compatible with any transformer architecture.
arXiv Detail & Related papers (2024-09-26T06:27:15Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval [49.01637233471453]
Previous Knowledge Distillation based efficient image retrieval methods employs a lightweight network as the student model for fast inference. We propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity. Our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher.
arXiv Detail & Related papers (2023-03-16T11:09:22Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR [12.953149757081025]
We focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR) We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech using Noisy Student Training with soft target distillation.
arXiv Detail & Related papers (2022-10-11T21:32:34Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.