Related papers: CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings

CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings

URL: http://arxiv.org/abs/2306.09594v1
Date: Fri, 16 Jun 2023 02:39:45 GMT
Title: CMLM-CSE: Based on Conditional MLM Contrastive Learning for Sentence Embeddings
Authors: Wei Zhang, Xu Chen
Abstract summary: We propose CMLM-CSE, an unsupervised contrastive learning framework based on conditional loss. An auxiliary network is added to integrate sentence embedding to perform tasks, forcing sentence embedding to learn more masked word information. When Bertbase was used as the pretraining language model, we exceeded SimCSE by 0.55 percentage points on average in textual similarity tasks, and when Robertabase was used as the pretraining language model, we exceeded SimCSE by 0.3 percentage points on average in textual similarity tasks.
Score: 16.592691470405683
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional comparative learning sentence embedding directly uses the encoder to extract sentence features, and then passes in the comparative loss function for learning. However, this method pays too much attention to the sentence body and ignores the influence of some words in the sentence on the sentence semantics. To this end, we propose CMLM-CSE, an unsupervised contrastive learning framework based on conditional MLM. On the basis of traditional contrastive learning, an additional auxiliary network is added to integrate sentence embedding to perform MLM tasks, forcing sentence embedding to learn more masked word information. Finally, when Bertbase was used as the pretraining language model, we exceeded SimCSE by 0.55 percentage points on average in textual similarity tasks, and when Robertabase was used as the pretraining language model, we exceeded SimCSE by 0.3 percentage points on average in textual similarity tasks.

Related papers

Following the Autoregressive Nature of LLM Embeddings via Compression and Alignment [69.67015515485349]
We propose AutoRegEmbed, a contrastive learning method built on embedding conditional probability distributions. We show that our method significantly outperforms traditional contrastive learning approaches.
arXiv Detail & Related papers (2025-02-17T03:36:25Z)
Tag&Tab: Pretraining Data Detection in Large Language Models Using Keyword-Based Membership Inference Attack [26.083244046813512]
Large language models (LLMs) have become essential digital task assistance tools. Recent studies on the detection of pretraining data in LLMs have primarily focused on sentence-level or paragraph-level membership inference attacks. We propose Tag&Tab, a novel approach for detecting data that has been used as part of the LLM pretraining.
arXiv Detail & Related papers (2025-01-14T21:55:37Z)
LLM-SEM: A Sentiment-Based Student Engagement Metric Using LLMS for E-Learning Platforms [0.0]
LLM-SEM (Language Model-Based Student Engagement Metric) is a novel approach that leverages video metadata and sentiment analysis of student comments to measure engagement. We generate high-quality sentiment predictions to mitigate text fuzziness and normalize key features such as views and likes. Our holistic method combines comprehensive metadata with sentiment polarity scores to gauge engagement at both the course and lesson levels.
arXiv Detail & Related papers (2024-12-18T12:01:53Z)
Which Syntactic Capabilities Are Statistically Learned by Masked Language Models for Code? [51.29970742152668]
We highlight relying on accuracy-based measurements may lead to an overestimation of models' capabilities. To address these issues, we introduce a technique called SyntaxEval in Syntactic Capabilities.
arXiv Detail & Related papers (2024-01-03T02:44:02Z)
Token Prediction as Implicit Classification to Identify LLM-Generated Text [37.89852204279844]
This paper introduces a novel approach for identifying the possible large language models (LLMs) involved in text generation. Instead of adding an additional classification layer to a base LM, we reframe the classification task as a next-token prediction task. We utilize the Text-to-Text Transfer Transformer (T5) model as the backbone for our experiments.
arXiv Detail & Related papers (2023-11-15T06:33:52Z)
Large Language Models can Contrastively Refine their Generation for Better Sentence Representation Learning [57.74233319453229]
Large language models (LLMs) have emerged as a groundbreaking technology and their unparalleled text generation capabilities have sparked interest in their application to the fundamental sentence representation learning task. We propose MultiCSR, a multi-level contrastive sentence representation learning framework that decomposes the process of prompting LLMs to generate a corpus. Our experiments reveal that MultiCSR enables a less advanced LLM to surpass the performance of ChatGPT, while applying it to ChatGPT achieves better state-of-the-art results.
arXiv Detail & Related papers (2023-10-17T03:21:43Z)
Instance Smoothed Contrastive Learning for Unsupervised Sentence Embedding [16.598732694215137]
We propose IS-CSE (instance smoothing contrastive sentence embedding) to smooth the boundaries of embeddings in the feature space. We evaluate our method on standard semantic text similarity (STS) tasks and achieve an average of 78.30%, 79.47%, 77.73%, and 79.42% Spearman's correlation.
arXiv Detail & Related papers (2023-05-12T12:46:13Z)
Alleviating Over-smoothing for Unsupervised Sentence Representation [96.19497378628594]
We present a Simple method named Self-Contrastive Learning (SSCL) to alleviate this issue. Our proposed method is quite simple and can be easily extended to various state-of-the-art models for performance boosting.
arXiv Detail & Related papers (2023-05-09T11:00:02Z)
Relational Sentence Embedding for Flexible Semantic Matching [86.21393054423355]
We present Sentence Embedding (RSE), a new paradigm to discover further the potential of sentence embeddings. RSE is effective and flexible in modeling sentence relations and outperforms a series of state-of-the-art embedding methods.
arXiv Detail & Related papers (2022-12-17T05:25:17Z)
InfoCSE: Information-aggregated Contrastive Learning of Sentence Embeddings [61.77760317554826]
This paper proposes an information-d contrastive learning framework for learning unsupervised sentence embeddings, termed InfoCSE. We evaluate the proposed InfoCSE on several benchmark datasets w.r.t the semantic text similarity (STS) task. Experimental results show that InfoCSE outperforms SimCSE by an average Spearman correlation of 2.60% on BERT-base, and 1.77% on BERT-large.
arXiv Detail & Related papers (2022-10-08T15:53:19Z)
Frustratingly Simple Pretraining Alternatives to Masked Language Modeling [10.732163031244651]
Masked language modeling (MLM) is widely used in natural language processing for learning text representations. In this paper, we explore five simple pretraining objectives based on token-level classification tasks as replacements of representations.
arXiv Detail & Related papers (2021-09-04T08:52:37Z)
Universal Sentence Representation Learning with Conditional Masked Language Model [7.334766841801749]
We present Conditional Masked Language Modeling (M) to effectively learn sentence representations. Our English CMLM model achieves state-of-the-art performance on SentEval. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains.
arXiv Detail & Related papers (2020-12-28T18:06:37Z)
Boosting Few-Shot Learning With Adaptive Margin Loss [109.03665126222619]
This paper proposes an adaptive margin principle to improve the generalization ability of metric-based meta-learning approaches for few-shot learning problems. Extensive experiments demonstrate that the proposed method can boost the performance of current metric-based meta-learning approaches.
arXiv Detail & Related papers (2020-05-28T07:58:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.