Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning
- URL: http://arxiv.org/abs/2409.12887v2
- Date: Wed, 2 Oct 2024 03:24:50 GMT
- Title: Enhancing Unsupervised Sentence Embeddings via Knowledge-Driven Data Augmentation and Gaussian-Decayed Contrastive Learning
- Authors: Peichao Lai, Zhengfeng Zhang, Wentao Zhang, Fangcheng Fu, Bin Cui,
- Abstract summary: We propose a pipeline-based data augmentation method via large language models (LLMs)
To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities.
To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples.
- Score: 37.54523122932728
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, using large language models (LLMs) for data augmentation has led to considerable improvements in unsupervised sentence embedding models. However, existing methods encounter two primary challenges: limited data diversity and high data noise. Current approaches often neglect fine-grained knowledge, such as entities and quantities, leading to insufficient diversity. Additionally, unsupervised data frequently lacks discriminative information, and the generated synthetic samples may introduce noise. In this paper, we propose a pipeline-based data augmentation method via LLMs and introduce the Gaussian-decayed gradient-assisted Contrastive Sentence Embedding (GCSE) model to enhance unsupervised sentence embeddings. To tackle the issue of low data diversity, our pipeline utilizes knowledge graphs (KGs) to extract entities and quantities, enabling LLMs to generate more diverse, knowledge-enriched samples. To address high data noise, the GCSE model uses a Gaussian-decayed function to limit the impact of false hard negative samples, enhancing the model's discriminative capability. Experimental results show that our approach achieves state-of-the-art performance in semantic textual similarity (STS) tasks, using fewer data samples and smaller LLMs, demonstrating its efficiency and robustness across various models.
Related papers
- Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets.
LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student.
Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z) - Boosting Disfluency Detection with Large Language Model as Disfluency Generator [8.836888435915077]
We propose a lightweight data augmentation approach for disfluency detection.
We leverage large language model (LLM) to generate disfluent sentences as augmentation data.
We apply an uncertainty-aware data filtering approach to improve the quality of the generated sentences.
arXiv Detail & Related papers (2024-03-13T04:14:33Z) - Detail Reinforcement Diffusion Model: Augmentation Fine-Grained Visual Categorization in Few-Shot Conditions [11.121652649243119]
Diffusion models have been widely adopted in data augmentation due to their outstanding diversity in data generation.
We propose a novel approach termed the detail reinforcement diffusion model(DRDM)
It leverages the rich knowledge of large models for fine-grained data augmentation and comprises two key components including discriminative semantic recombination (DSR) and spatial knowledge reference(SKR)
arXiv Detail & Related papers (2023-09-15T01:28:59Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis [50.31589712761807]
Large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs.
We investigate the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting.
Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives.
arXiv Detail & Related papers (2023-05-22T17:02:15Z) - A Guide for Practical Use of ADMG Causal Data Augmentation [0.0]
Causal data augmentation strategies have been pointed out as a solution to handle these challenges.
This paper experimentally analyzed the ADMG causal augmentation method considering different settings.
arXiv Detail & Related papers (2023-04-03T09:31:13Z) - Temporal Output Discrepancy for Loss Estimation-based Active Learning [65.93767110342502]
We present a novel deep active learning approach that queries the oracle for data annotation when the unlabeled sample is believed to incorporate high loss.
Our approach achieves superior performances than the state-of-the-art active learning methods on image classification and semantic segmentation tasks.
arXiv Detail & Related papers (2022-12-20T19:29:37Z) - FakeCLR: Exploring Contrastive Learning for Solving Latent Discontinuity
in Data-Efficient GANs [24.18718734850797]
Data-Efficient GANs (DE-GANs) aim to learn generative models with a limited amount of training data.
Contrastive learning has shown the great potential of increasing the synthesis quality of DE-GANs.
We propose FakeCLR, which only applies contrastive learning on fake samples.
arXiv Detail & Related papers (2022-07-18T14:23:38Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - Generative Data Augmentation for Commonsense Reasoning [75.26876609249197]
G-DAUGC is a novel generative data augmentation method that aims to achieve more accurate and robust learning in the low-resource setting.
G-DAUGC consistently outperforms existing data augmentation methods based on back-translation.
Our analysis demonstrates that G-DAUGC produces a diverse set of fluent training examples, and that its selection and training approaches are important for performance.
arXiv Detail & Related papers (2020-04-24T06:12:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.