ESimCSE: Enhanced Sample Building Method for Contrastive Learning of
Unsupervised Sentence Embedding
- URL: http://arxiv.org/abs/2109.04380v1
- Date: Thu, 9 Sep 2021 16:07:31 GMT
- Title: ESimCSE: Enhanced Sample Building Method for Contrastive Learning of
Unsupervised Sentence Embedding
- Authors: Xing Wu, Chaochen Gao, Liangjun Zang, Jizhong Han, Zhongyuan Wang,
Songlin Hu
- Abstract summary: The current state-of-the-art unsupervised method is the unsupervised SimCSE (unsup-SimCSE)
We develop a new sentence embedding method, termed Enhanced Unsup-SimCSE (ESimCSE)
ESimCSE outperforms the state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on BERT-base.
- Score: 41.09180639504244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive learning has been attracting much attention for learning
unsupervised sentence embeddings. The current state-of-the-art unsupervised
method is the unsupervised SimCSE (unsup-SimCSE). Unsup-SimCSE takes dropout as
a minimal data augmentation method, and passes the same input sentence to a
pre-trained Transformer encoder (with dropout turned on) twice to obtain the
two corresponding embeddings to build a positive pair. As the length
information of a sentence will generally be encoded into the sentence
embeddings due to the usage of position embedding in Transformer, each positive
pair in unsup-SimCSE actually contains the same length information. And thus
unsup-SimCSE trained with these positive pairs is probably biased, which would
tend to consider that sentences of the same or similar length are more similar
in semantics. Through statistical observations, we find that unsup-SimCSE does
have such a problem. To alleviate it, we apply a simple repetition operation to
modify the input sentence, and then pass the input sentence and its modified
counterpart to the pre-trained Transformer encoder, respectively, to get the
positive pair. Additionally, we draw inspiration from the community of computer
vision and introduce a momentum contrast, enlarging the number of negative
pairs without additional calculations. The proposed two modifications are
applied on positive and negative pairs separately, and build a new sentence
embedding method, termed Enhanced Unsup-SimCSE (ESimCSE). We evaluate the
proposed ESimCSE on several benchmark datasets w.r.t the semantic text
similarity (STS) task. Experimental results show that ESimCSE outperforms the
state-of-the-art unsup-SimCSE by an average Spearman correlation of 2.02% on
BERT-base.
Related papers
- Evaluating Semantic Variation in Text-to-Image Synthesis: A Causal Perspective [50.261681681643076]
We propose a novel metric called SemVarEffect and a benchmark named SemVarBench to evaluate the causality between semantic variations in inputs and outputs in text-to-image synthesis.
Our work establishes an effective evaluation framework that advances the T2I synthesis community's exploration of human instruction understanding.
arXiv Detail & Related papers (2024-10-14T08:45:35Z) - REBAR: Retrieval-Based Reconstruction for Time-series Contrastive Learning [64.08293076551601]
We propose a novel method of using a learned measure for identifying positive pairs.
Our Retrieval-Based Reconstruction measure measures the similarity between two sequences.
We show that the REBAR error is a predictor of mutual class membership.
arXiv Detail & Related papers (2023-11-01T13:44:45Z) - Generate, Discriminate and Contrast: A Semi-Supervised Sentence
Representation Learning Framework [68.04940365847543]
We propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data.
Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data.
arXiv Detail & Related papers (2022-10-30T10:15:21Z) - InfoCSE: Information-aggregated Contrastive Learning of Sentence
Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE.
We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task.
Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z) - Improving Contrastive Learning of Sentence Embeddings with
Case-Augmented Positives and Retrieved Negatives [17.90820242798732]
Unsupervised contrastive learning methods still lag far behind the supervised counterparts.
We propose switch-case augmentation to flip the case of the first letter of randomly selected words in a sentence.
For negative samples, we sample hard negatives from the whole dataset based on a pre-trained language model.
arXiv Detail & Related papers (2022-06-06T09:46:12Z) - DiffCSE: Difference-based Contrastive Learning for Sentence Embeddings [51.274478128525686]
DiffCSE is an unsupervised contrastive learning framework for learning sentence embeddings.
Our experiments show that DiffCSE achieves state-of-the-art results among unsupervised sentence representation learning methods.
arXiv Detail & Related papers (2022-04-21T17:32:01Z) - S-SimCSE: Sampled Sub-networks for Contrastive Learning of Sentence
Embedding [2.9894971434911266]
Contrastive learning has been studied for improving the performance of learning sentence embeddings.
The current state-of-the-art method is the SimCSE, which takes dropout as the data augmentation method.
S-SimCSE outperforms the state-of-the-art SimCSE more than $1%$ on BERT$_base$
arXiv Detail & Related papers (2021-11-23T09:52:45Z) - Smoothed Contrastive Learning for Unsupervised Sentence Embedding [41.09180639504244]
We introduce a smoothing strategy upon the InfoNCE loss function, termedGaussian Smoothing InfoNCE (GS-InfoNCE)
GS-InfoNCE outperforms the state-of-the-art unsup-SimCSE by an average Spear-man correlation of 1.38%, 0.72%, 1.17% and 0.28% on the base of BERT-base, BERT-large,RoBERTa-base and RoBERTa-large, respectively.
arXiv Detail & Related papers (2021-09-09T14:54:24Z) - SimCSE: Simple Contrastive Learning of Sentence Embeddings [10.33373737281907]
This paper presents SimCSE, a contrastive learning framework for embeddings.
We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective.
We then incorporate annotated pairs from NLI datasets into contrastive learning by using "entailment" pairs as positives and "contradiction" pairs as hard negatives.
arXiv Detail & Related papers (2021-04-18T11:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.