Extreme compression of sentence-transformer ranker models: faster
inference, longer battery life, and less storage on edge devices
- URL: http://arxiv.org/abs/2207.12852v1
- Date: Wed, 29 Jun 2022 08:07:09 GMT
- Title: Extreme compression of sentence-transformer ranker models: faster
inference, longer battery life, and less storage on edge devices
- Authors: Amit Chaulwar, Lukas Malik, Maciej Krajewski, Felix Reichel,
Leif-Nissen Lundb{\ae}k, Michael Huth and Bartlomiej Matejczyk
- Abstract summary: We propose two extensions for a popular sentence-transformer distillation procedure to reduce memory requirements and energy consumption.
We evaluate these extensions on two different types of ranker models.
- Score: 1.3854111346209868
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Modern search systems use several large ranker models with transformer
architectures. These models require large computational resources and are not
suitable for usage on devices with limited computational resources. Knowledge
distillation is a popular compression technique that can reduce the resource
needs of such models, where a large teacher model transfers knowledge to a
small student model. To drastically reduce memory requirements and energy
consumption, we propose two extensions for a popular sentence-transformer
distillation procedure: generation of an optimal size vocabulary and
dimensionality reduction of the embedding dimension of teachers prior to
distillation. We evaluate these extensions on two different types of ranker
models. This results in extremely compressed student models whose analysis on a
test dataset shows the significance and utility of our proposed extensions.
Related papers
- Prompt-Based Exemplar Super-Compression and Regeneration for
Class-Incremental Learning [22.676222987218555]
Super-compression and regeneration method, ESCORT, substantially increases the quantity and enhances the diversity of exemplars.
To minimize the domain gap between generated exemplars and real images, we propose partial compression and diffusion-based data augmentation.
arXiv Detail & Related papers (2023-11-30T05:59:31Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of
Language Model [92.55145016562867]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Multi-stage Progressive Compression of Conformer Transducer for
On-device Speech Recognition [7.450574974954803]
Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models.
Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size.
We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
arXiv Detail & Related papers (2022-10-01T02:23:00Z) - Ensemble Transformer for Efficient and Accurate Ranking Tasks: an
Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model.
An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads.
Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z) - Sparse Distillation: Speeding Up Text Classification by Using Bigger
Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time.
In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model.
Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - Knowledge Distillation: A Survey [87.51063304509067]
Deep neural networks have been successful in both industry and academia, especially for computer vision tasks.
It is a challenge to deploy these cumbersome deep models on devices with limited resources.
Knowledge distillation effectively learns a small student model from a large teacher model.
arXiv Detail & Related papers (2020-06-09T21:47:17Z) - MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression
of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models.
We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student.
Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.