SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning
- URL: http://arxiv.org/abs/2210.03963v3
- Date: Fri, 14 Jun 2024 03:42:17 GMT
- Title: SDA: Simple Discrete Augmentation for Contrastive Sentence Representation Learning
- Authors: Dongsheng Zhu, Zhenyu Mao, Jinghui Lu, Rui Zhao, Fei Tan,
- Abstract summary: SimCSE surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported.
We develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation.
Results support the superiority of the proposed methods consistently.
- Score: 14.028140579482688
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Contrastive learning has recently achieved compelling performance in unsupervised sentence representation. As an essential element, data augmentation protocols, however, have not been well explored. The pioneering work SimCSE resorting to a simple dropout mechanism (viewed as continuous augmentation) surprisingly dominates discrete augmentations such as cropping, word deletion, and synonym replacement as reported. To understand the underlying rationales, we revisit existing approaches and attempt to hypothesize the desiderata of reasonable data augmentation methods: balance of semantic consistency and expression diversity. We then develop three simple yet effective discrete sentence augmentation schemes: punctuation insertion, modal verbs, and double negation. They act as minimal noises at lexical level to produce diverse forms of sentences. Furthermore, standard negation is capitalized on to generate negative samples for alleviating feature suppression involved in contrastive learning. We experimented extensively with semantic textual similarity on diverse datasets. The results support the superiority of the proposed methods consistently. Our key code is available at https://github.com/Zhudongsheng75/SDA
Related papers
- Generating Diverse Negations from Affirmative Sentences [0.999726509256195]
Negations are important in real-world applications as they encode negative polarity in verb phrases, clauses, or other expressions.
We propose NegVerse, a method that tackles the lack of negation datasets by producing a diverse range of negation types.
We provide new rules for masking parts of sentences where negations are most likely to occur, based on syntactic structure.
We also propose a filtering mechanism to identify negation cues and remove degenerate examples, producing a diverse range of meaningful perturbations.
arXiv Detail & Related papers (2024-10-30T21:25:02Z) - DenoSent: A Denoising Objective for Self-Supervised Sentence
Representation Learning [59.4644086610381]
We propose a novel denoising objective that inherits from another perspective, i.e., the intra-sentence perspective.
By introducing both discrete and continuous noise, we generate noisy sentences and then train our model to restore them to their original form.
Our empirical evaluations demonstrate that this approach delivers competitive results on both semantic textual similarity (STS) and a wide range of transfer tasks.
arXiv Detail & Related papers (2024-01-24T17:48:45Z) - RankCSE: Unsupervised Sentence Representations Learning via Learning to
Rank [54.854714257687334]
We propose a novel approach, RankCSE, for unsupervised sentence representation learning.
It incorporates ranking consistency and ranking distillation with contrastive learning into a unified framework.
An extensive set of experiments are conducted on both semantic textual similarity (STS) and transfer (TR) tasks.
arXiv Detail & Related papers (2023-05-26T08:27:07Z) - Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue.
Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z) - Generate, Discriminate and Contrast: A Semi-Supervised Sentence
Representation Learning Framework [68.04940365847543]
We propose a semi-supervised sentence embedding framework, GenSE, that effectively leverages large-scale unlabeled data.
Our method include three parts: 1) Generate: A generator/discriminator model is jointly trained to synthesize sentence pairs from open-domain unlabeled corpus; 2) Discriminate: Noisy sentence pairs are filtered out by the discriminator to acquire high-quality positive and negative sentence pairs; 3) Contrast: A prompt-based contrastive approach is presented for sentence representation learning with both annotated and synthesized data.
arXiv Detail & Related papers (2022-10-30T10:15:21Z) - Debiased Contrastive Learning of Unsupervised Sentence Representations [88.58117410398759]
Contrastive learning is effective in improving pre-trained language models (PLM) to derive high-quality sentence representations.
Previous works mostly adopt in-batch negatives or sample from training data at random.
We present a new framework textbfDCLR to alleviate the influence of these improper negatives.
arXiv Detail & Related papers (2022-05-02T05:07:43Z) - SNCSE: Contrastive Learning for Unsupervised Sentence Embedding with
Soft Negative Samples [36.08601841321196]
We propose contrastive learning for unsupervised sentence embedding with soft negative samples.
We show that SNCSE can obtain state-of-the-art performance on semantic textual similarity task.
arXiv Detail & Related papers (2022-01-16T06:15:43Z) - SimCSE: Simple Contrastive Learning of Sentence Embeddings [10.33373737281907]
This paper presents SimCSE, a contrastive learning framework for embeddings.
We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective.
We then incorporate annotated pairs from NLI datasets into contrastive learning by using "entailment" pairs as positives and "contradiction" pairs as hard negatives.
arXiv Detail & Related papers (2021-04-18T11:27:08Z) - CLEAR: Contrastive Learning for Sentence Representation [41.867438597420346]
We propose Contrastive LEArning for sentence Representation (CLEAR), which employs multiple sentence-level augmentation strategies.
These augmentations include word and span deletion, reordering, and substitution.
Our approach is shown to outperform multiple existing methods on both SentEval and GLUE benchmarks.
arXiv Detail & Related papers (2020-12-31T06:40:13Z) - On the Sentence Embeddings from Pre-trained Language Models [78.45172445684126]
In this paper, we argue that the semantic information in the BERT embeddings is not fully exploited.
We find that BERT always induces a non-smooth anisotropic semantic space of sentences, which harms its performance of semantic similarity.
We propose to transform the anisotropic sentence embedding distribution to a smooth and isotropic Gaussian distribution through normalizing flows that are learned with an unsupervised objective.
arXiv Detail & Related papers (2020-11-02T13:14:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.