Self-supervised Document Clustering Based on BERT with Data Augment
- URL: http://arxiv.org/abs/2011.08523v3
- Date: Fri, 17 Sep 2021 03:18:09 GMT
- Title: Self-supervised Document Clustering Based on BERT with Data Augment
- Authors: Haoxiang Shi and Cen Wang
- Abstract summary: We propose self-supervised contrastive learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised data augmentation (UDA) for text clustering.
SCL outperforms state-of-the-art unsupervised clustering approaches for short texts and those for long texts in terms of several clustering evaluation measures.
FCL achieves performance close to supervised learning, and FCL with UDA further improves the performance for short texts.
- Score: 1.0152838128195467
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive learning is a promising approach to unsupervised learning, as it
inherits the advantages of well-studied deep models without a dedicated and
complex model design. In this paper, based on bidirectional encoder
representations from transformers, we propose self-supervised contrastive
learning (SCL) as well as few-shot contrastive learning (FCL) with unsupervised
data augmentation (UDA) for text clustering. SCL outperforms state-of-the-art
unsupervised clustering approaches for short texts and those for long texts in
terms of several clustering evaluation measures. FCL achieves performance close
to supervised learning, and FCL with UDA further improves the performance for
short texts.
Related papers
- Semantic Consistency Regularization with Large Language Models for Semi-supervised Sentiment Analysis [20.503153899462323]
We propose a framework for semi-supervised sentiment analysis.
We introduce two prompting strategies to semantically enhance unlabeled text.
Experiments show our method achieves remarkable performance over prior semi-supervised methods.
arXiv Detail & Related papers (2025-01-29T12:03:11Z) - Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method.
We propose a Codebook Rearrangement strategy that uses balanced k-means clustering algorithm.
We also propose a Cluster-oriented Cross-entropy Loss that guides the model to correctly predict the cluster where the token is located.
arXiv Detail & Related papers (2025-01-01T15:58:51Z) - Revisiting Self-Supervised Heterogeneous Graph Learning from Spectral Clustering Perspective [52.662463893268225]
Self-supervised heterogeneous graph learning (SHGL) has shown promising potential in diverse scenarios.
Existing SHGL methods encounter two significant limitations.
We introduce a novel framework enhanced by rank and dual consistency constraints.
arXiv Detail & Related papers (2024-12-01T09:33:20Z) - Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation [21.20806568508201]
We show how to leverage class text information to mitigate distribution drifts encountered by vision-language models (VLMs) during test-time inference.
We propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem.
Experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT.
arXiv Detail & Related papers (2024-11-26T00:15:37Z) - Text Clustering with Large Language Model Embeddings [0.0]
The effectiveness of text clustering largely depends on the selection of textual embeddings and clustering algorithms.
Recent advancements in large language models (LLMs) have the potential to enhance this task.
Findings indicate that LLM embeddings are superior at capturing subtleties in structured language.
arXiv Detail & Related papers (2024-03-22T11:08:48Z) - MA2CL:Masked Attentive Contrastive Learning for Multi-Agent
Reinforcement Learning [128.19212716007794]
We propose an effective framework called textbfMulti-textbfAgent textbfMasked textbfAttentive textbfContrastive textbfLearning (MA2CL)
MA2CL encourages learning representation to be both temporal and agent-level predictive by reconstructing the masked agent observation in latent space.
Our method significantly improves the performance and sample efficiency of different MARL algorithms and outperforms other methods in various vision-based and state-based scenarios.
arXiv Detail & Related papers (2023-06-03T05:32:19Z) - Using Representation Expressiveness and Learnability to Evaluate
Self-Supervised Learning Methods [61.49061000562676]
We introduce Cluster Learnability (CL) to assess learnability.
CL is measured in terms of the performance of a KNN trained to predict labels obtained by clustering the representations with K-means.
We find that CL better correlates with in-distribution model performance than other competing recent evaluation schemes.
arXiv Detail & Related papers (2022-06-02T19:05:13Z) - Revisiting LSTM Networks for Semi-Supervised Text Classification via
Mixed Objective Function [106.69643619725652]
We develop a training strategy that allows even a simple BiLSTM model, when trained with cross-entropy loss, to achieve competitive results.
We report state-of-the-art results for text classification task on several benchmark datasets.
arXiv Detail & Related papers (2020-09-08T21:55:22Z) - Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method.
PCL implicitly encodes semantic structures of the data into the learned embedding space.
PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.