Related papers: Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices

Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices

URL: http://arxiv.org/abs/2207.12852v1
Date: Wed, 29 Jun 2022 08:07:09 GMT
Title: Extreme compression of sentence-transformer ranker models: faster inference, longer battery life, and less storage on edge devices
Authors: Amit Chaulwar, Lukas Malik, Maciej Krajewski, Felix Reichel, Leif-Nissen Lundb{\ae}k, Michael Huth and Bartlomiej Matejczyk
Abstract summary: We propose two extensions for a popular sentence-transformer distillation procedure to reduce memory requirements and energy consumption. We evaluate these extensions on two different types of ranker models.
Score: 1.3854111346209868
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Modern search systems use several large ranker models with transformer architectures. These models require large computational resources and are not suitable for usage on devices with limited computational resources. Knowledge distillation is a popular compression technique that can reduce the resource needs of such models, where a large teacher model transfers knowledge to a small student model. To drastically reduce memory requirements and energy consumption, we propose two extensions for a popular sentence-transformer distillation procedure: generation of an optimal size vocabulary and dimensionality reduction of the embedding dimension of teachers prior to distillation. We evaluate these extensions on two different types of ranker models. This results in extremely compressed student models whose analysis on a test dataset shows the significance and utility of our proposed extensions.

Related papers

Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
Multi-stage Progressive Compression of Conformer Transducer for On-device Speech Recognition [7.450574974954803]
Small memory bandwidth in smart devices prompts development of smaller Automatic Speech Recognition (ASR) models. Knowledge distillation (KD) is a popular model compression approach that has shown to achieve smaller model size. We propose a multi-stage progressive approach to compress the conformer transducer model using KD.
arXiv Detail & Related papers (2022-10-01T02:23:00Z)
Knowledge Distillation of Russian Language Models with Reduction of Vocabulary [0.1092387707389144]
Transformer language models serve as a core component for majority of natural language processing tasks. Existing methods in this field are mainly focused on reducing the number of layers or dimension of embeddings/hidden representations. We propose two simple yet effective alignment techniques to make knowledge distillation to the students with reduced vocabulary.
arXiv Detail & Related papers (2022-05-04T21:56:57Z)
Ensemble Transformer for Efficient and Accurate Ranking Tasks: an Application to Question Answering Systems [99.13795374152997]
We propose a neural network designed to distill an ensemble of large transformers into a single smaller model. An MHS model consists of two components: a stack of transformer layers that is used to encode inputs, and a set of ranking heads. Unlike traditional distillation techniques, our approach leverages individual models in ensemble as teachers in a way that preserves the diversity of the ensemble members.
arXiv Detail & Related papers (2022-01-15T06:21:01Z)
Sparse Distillation: Speeding Up Text Classification by Using Bigger Models [49.8019791766848]
Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks.
arXiv Detail & Related papers (2021-10-16T10:04:14Z)
TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration. We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech. TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
Knowledge Distillation: A Survey [87.51063304509067]
Deep neural networks have been successful in both industry and academia, especially for computer vision tasks. It is a challenge to deploy these cumbersome deep models on devices with limited resources. Knowledge distillation effectively learns a small student model from a large teacher model.
arXiv Detail & Related papers (2020-06-09T21:47:17Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.