Related papers: Extracting Sentence Embeddings from Pretrained Transformer Models

Extracting Sentence Embeddings from Pretrained Transformer Models

URL: http://arxiv.org/abs/2408.08073v1
Date: Thu, 15 Aug 2024 10:54:55 GMT
Title: Extracting Sentence Embeddings from Pretrained Transformer Models
Authors: Lukas Stankevičius, Mantas Lukoševičius,
Abstract summary: Given 110M parameters BERT's hidden representations from multiple layers and multiple tokens we tried various ways to extract optimal sentence representations. All methods were tested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12 classification tasks. Very high improvements for static token-based models, especially random embeddings for STS tasks almost reach the performance of BERT-based representations.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Background/introduction: Pre-trained transformer models shine in many natural language processing tasks and therefore are expected to bear the representation of the input sentence or text meaning. These sentence-level embeddings are also important in retrieval-augmented generation. But do commonly used plain averaging or prompt templates surface it enough? Methods: Given 110M parameters BERT's hidden representations from multiple layers and multiple tokens we tried various ways to extract optimal sentence representations. We tested various token aggregation and representation post-processing techniques. We also tested multiple ways of using a general Wikitext dataset to complement BERTs sentence representations. All methods were tested on 8 Semantic Textual Similarity (STS), 6 short text clustering, and 12 classification tasks. We also evaluated our representation-shaping techniques on other static models, including random token representations. Results: Proposed representation extraction methods improved the performance on STS and clustering tasks for all models considered. Very high improvements for static token-based models, especially random embeddings for STS tasks almost reach the performance of BERT-derived representations. Conclusions: Our work shows that for multiple tasks simple baselines with representation shaping techniques reach or even outperform more complex BERT-based models or are able to contribute to their performance.

Related papers

Large Language Model as Attributed Training Data Generator: A Tale of Diversity and Bias [92.41919689753051]
Large language models (LLMs) have been recently leveraged as training data generators for various natural language processing (NLP) tasks. We investigate training data generation with diversely attributed prompts, which have the potential to yield diverse and attributed generated data. We show that attributed prompts outperform simple class-conditional prompts in terms of the resulting model's performance.
arXiv Detail & Related papers (2023-06-28T03:31:31Z)
Ditto: A Simple and Efficient Approach to Improve Sentence Embeddings [29.273438110694574]
Sentence embeddings from pre-trained language models suffer from a bias towards uninformative words. We propose a simple and efficient unsupervised approach, Diagonal Attention Pooling (Ditto), which weights words with model-based importance estimations. We show Ditto can alleviate the anisotropy problem and improve various pre-trained models on semantic textual similarity tasks.
arXiv Detail & Related papers (2023-05-18T07:56:40Z)
Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue. Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z)
SimpleBERT: A Pre-trained Model That Learns to Generate Simple Words [59.142185753887645]
In this work, we propose a continued pre-training method for text simplification. We use a small-scale simple text dataset for continued pre-training and employ two methods to identify simple words. We obtain SimpleBERT, which surpasses BERT in both lexical simplification and sentence simplification tasks.
arXiv Detail & Related papers (2022-04-16T11:28:01Z)
MoEBERT: from BERT to Mixture-of-Experts via Importance-Guided Adaptation [68.30497162547768]
We propose MoEBERT, which uses a Mixture-of-Experts structure to increase model capacity and inference speed. We validate the efficiency and effectiveness of MoEBERT on natural language understanding and question answering tasks.
arXiv Detail & Related papers (2022-04-15T23:19:37Z)
PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM) We permute a proportion of the input text, and the training objective is to predict the position of the original token. We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z)
CharBERT: Character-aware Pre-trained Language Model [36.9333890698306]
We propose a character-aware pre-trained language model named CharBERT. We first construct the contextual word embedding for each token from the sequential character representations. We then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module.
arXiv Detail & Related papers (2020-11-03T07:13:06Z)
Automated Concatenation of Embeddings for Structured Prediction [75.44925576268052]
We propose Automated Concatenation of Embeddings (ACE) to automate the process of finding better concatenations of embeddings for structured prediction tasks. We follow strategies in reinforcement learning to optimize the parameters of the controller and compute the reward based on the accuracy of a task model.
arXiv Detail & Related papers (2020-10-10T14:03:20Z)
Multiple Word Embeddings for Increased Diversity of Representation [15.279850826041066]
We show a technique that substantially and consistently improves performance over a strong baseline with negligible increase in run time. We analyze aspects of pre-trained embedding similarity and vocabulary coverage and find that the representational diversity is the driving force of why this technique works.
arXiv Detail & Related papers (2020-09-30T02:33:09Z)
POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z)
BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity. Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset. We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z)
SBERT-WK: A Sentence Embedding Method by Dissecting BERT-based Word Models [43.18970770343777]
A contextualized word representation, called BERT, achieves the state-of-the-art performance in quite a few NLP tasks. Yet, it is an open problem to generate a high quality sentence representation from BERT-based word models. We propose a new sentence embedding method by dissecting BERT-based word models through geometric analysis of the space spanned by the word representation.
arXiv Detail & Related papers (2020-02-16T19:02:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.