Compressing Sentence Representation for Semantic Retrieval via
Homomorphic Projective Distillation
- URL: http://arxiv.org/abs/2203.07687v1
- Date: Tue, 15 Mar 2022 07:05:43 GMT
- Title: Compressing Sentence Representation for Semantic Retrieval via
Homomorphic Projective Distillation
- Authors: Xuandong Zhao, Zhiguo Yu, Ming Wu, Lei Li
- Abstract summary: We propose Homomorphic Projective Distillation (HPD) to learn compressed sentence embeddings.
Our method augments a small Transformer encoder model with learnable projection layers to produce compact representations.
- Score: 28.432799973328127
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: How to learn highly compact yet effective sentence representation?
Pre-trained language models have been effective in many NLP tasks. However,
these models are often huge and produce large sentence embeddings. Moreover,
there is a big performance gap between large and small models. In this paper,
we propose Homomorphic Projective Distillation (HPD) to learn compressed
sentence embeddings. Our method augments a small Transformer encoder model with
learnable projection layers to produce compact representations while mimicking
a large pre-trained language model to retain the sentence representation
quality. We evaluate our method with different model sizes on both semantic
textual similarity (STS) and semantic retrieval (SR) tasks. Experiments show
that our method achieves 2.7-4.5 points performance gain on STS tasks compared
with previous best representations of the same size. In SR tasks, our method
improves retrieval speed (8.2$\times$) and memory usage (8.0$\times$) compared
with state-of-the-art large models.
Related papers
- SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Pruning Large Language Models with Semi-Structural Adaptive Sparse Training [17.381160429641316]
We propose a pruning pipeline for semi-structured sparse models via retraining, termed Adaptive Sparse Trainer (AST)
AST transforms dense models into sparse ones by applying decay to masked weights while allowing the model to adaptively select masks throughout the training process.
Our work demonstrates the feasibility of deploying semi-structured sparse large language models and introduces a novel method for achieving highly compressed models.
arXiv Detail & Related papers (2024-07-30T06:33:44Z) - ESE: Espresso Sentence Embeddings [11.682642816354418]
High-quality sentence embeddings are fundamental in many natural language processing (NLP) tasks.
We propose a novel sentence embedding model $mathrmEspresso$ $mathrmSentence$ $mathrmEmbeddings$ (ESE) with two learning processes.
arXiv Detail & Related papers (2024-02-22T18:35:05Z) - Learning High-Quality and General-Purpose Phrase Representations [9.246374019271938]
Phrase representations play an important role in data science and natural language processing.
Current state-of-the-art method involves fine-tuning pre-trained language models for phrasal embeddings.
We propose an improved framework to learn phrase representations in a context-free fashion.
arXiv Detail & Related papers (2024-01-18T22:32:31Z) - Compressing Sentence Representation with maximum Coding Rate Reduction [0.0]
In most natural language inference problems, sentence representation is needed for semantic retrieval tasks.
Due to space and time hardware limitations, there is a need to attain comparable results when using the smaller model.
We demonstrate that the new language model with reduced complexity and sentence embedding size can achieve comparable results on semantic retrieval benchmarks.
arXiv Detail & Related papers (2023-04-25T09:23:43Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - EPIC TTS Models: Empirical Pruning Investigations Characterizing
Text-To-Speech Models [26.462819114575172]
This work compares sparsity paradigms in text-to-speech synthesis.
It is the first work that compares sparsity paradigms in text-to-speech synthesis.
arXiv Detail & Related papers (2022-09-22T09:47:25Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.