Extracting Sentence Embeddings from Pretrained Transformer Models
- URL: http://arxiv.org/abs/2408.08073v2
- Date: Thu, 20 Feb 2025 14:33:47 GMT
- Title: Extracting Sentence Embeddings from Pretrained Transformer Models
- Authors: Lukas Stankevičius, Mantas Lukoševičius,
- Abstract summary: We show that representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.
All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks.
Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations.
- Score: 0.0
- License:
- Abstract: Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates sufficiently capture and represent the underlying meaning? After providing a comprehensive review of existing sentence embedding extraction and refinement methods, we thoroughly test different combinations and our original extensions of the most promising ones on pretrained models. Namely, given 110 M parameters, BERT's hidden representations from multiple layers, and many tokens, we try diverse ways to extract optimal sentence embeddings. We test various token aggregation and representation post-processing techniques. We also test multiple ways of using a general Wikitext dataset to complement BERT's sentence embeddings. All methods are tested on eight Semantic Textual Similarity (STS), six short text clustering, and twelve classification tasks. We also evaluate our representation-shaping techniques on other static models, including random token representations. Proposed representation extraction methods improve the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks, almost reach the performance of BERT-derived representations. Our work shows that the representation-shaping techniques significantly improve sentence embeddings extracted from BERT-based and simple baseline models.
Related papers
- Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings [29.273438110694574]
Sentence embeddings from pre-trained language models suffer from a bias towards uninformative words.
We propose a simple and efficient unsupervised approach, Diagonal Attention Pooling (Ditto), which weights words with model-based importance estimations.
We show Ditto can alleviate the anisotropy problem and improve various pre-trained models on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-05-18T07:56:40Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words [59.142185753887645]
In this work, we propose a continued pre-training method for text simplification.
We use a small-scale simple text dataset for continued pre-training and employ two methods to identify simple words.
We obtain SimpleBERT, which surpasses BERT in both lexical simplification and sentence simplification tasks.
arXiv Detail & Related papers (2022-04-16T11:28:01Z) - MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided
Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed.
We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - CharBERT: Character-aware Pre-trained Language Model [36.9333890698306]
We propose a character-aware pre-trained language model named CharBERT.
We first construct the contextual word embedding for each token from the sequential character representations.
We then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module.
arXiv Detail & Related papers (2020-11-03T07:13:06Z) - Automated Concatenation of Embeddings for Structured Prediction [75.44925576268052]
We propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks.
We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model.
arXiv Detail & Related papers (2020-10-10T14:03:20Z) - Multiple Word Embeddings for Increased Diversity of Representation [15.279850826041066]
We show a technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time.
We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity is the driving force of why this technique works.
arXiv Detail & Related papers (2020-09-30T02:33:09Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word
Models [43.18970770343777]
A contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks.
Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models.
We propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation.
arXiv Detail & Related papers (2020-02-16T19:02:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.