Robust Representation Learning with Reliable Pseudo-labels Generation
via Self-Adaptive Optimal Transport for Short Text Clustering
- URL: http://arxiv.org/abs/2305.16335v1
- Date: Tue, 23 May 2023 12:43:40 GMT
- Title: Robust Representation Learning with Reliable Pseudo-labels Generation
via Self-Adaptive Optimal Transport for Short Text Clustering
- Authors: Xiaolin Zheng, Mengling Hu, Weiming Liu, Chaochao Chen, and Xinting
Liao
- Abstract summary: We propose a Robust Short Text Clustering model to improve robustness against imbalanced and noisy data.
To improve robustness against the noise in data, we introduce both class-wise and instance-wise contrastive learning.
Our empirical studies on eight short text clustering datasets demonstrate that RSTC significantly outperforms the state-of-the-art models.
- Score: 13.83404821252712
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Short text clustering is challenging since it takes imbalanced and noisy data
as inputs. Existing approaches cannot solve this problem well, since (1) they
are prone to obtain degenerate solutions especially on heavy imbalanced
datasets, and (2) they are vulnerable to noises. To tackle the above issues, we
propose a Robust Short Text Clustering (RSTC) model to improve robustness
against imbalanced and noisy data. RSTC includes two modules, i.e.,
pseudo-label generation module and robust representation learning module. The
former generates pseudo-labels to provide supervision for the later, which
contributes to more robust representations and correctly separated clusters. To
provide robustness against the imbalance in data, we propose self-adaptive
optimal transport in the pseudo-label generation module. To improve robustness
against the noise in data, we further introduce both class-wise and
instance-wise contrastive learning in the robust representation learning
module. Our empirical studies on eight short text clustering datasets
demonstrate that RSTC significantly outperforms the state-of-the-art models.
The code is available at: https://github.com/hmllmh/RSTC.
Related papers
- Conformal-in-the-Loop for Learning with Imbalanced Noisy Data [5.69777817429044]
Class imbalance and label noise are pervasive in large-scale datasets.
Much of machine learning research assumes well-labeled, balanced data, which rarely reflects real world conditions.
We propose Conformal-in-the-Loop (CitL), a novel training framework that addresses both challenges with a conformal prediction-based approach.
arXiv Detail & Related papers (2024-11-04T17:09:58Z) - Contrastive and Consistency Learning for Neural Noisy-Channel Model in Spoken Language Understanding [1.07288078404291]
We propose a natural language understanding approach based on Automatic Speech Recognition (ASR)
We improve a noisy-channel model to handle transcription inconsistencies caused by ASR errors.
Experiments on four benchmark datasets show that Contrastive and Consistency Learning (CCL) outperforms existing methods.
arXiv Detail & Related papers (2024-05-23T23:10:23Z) - RoNID: New Intent Discovery with Generated-Reliable Labels and Cluster-friendly Representations [27.775731666470175]
New Intent Discovery (NID) aims to identify novel intent groups in the open-world scenario.
Current methods face issues with inaccurate pseudo-labels and poor representation learning.
We propose a Robust New Intent Discovery framework optimized by an EM-style method.
arXiv Detail & Related papers (2024-04-13T11:58:28Z) - Match me if you can: Semi-Supervised Semantic Correspondence Learning with Unpaired Images [76.47980643420375]
This paper builds on the hypothesis that there is an inherent data-hungry matter in learning semantic correspondences.
We demonstrate a simple machine annotator reliably enriches paired key points via machine supervision.
Our models surpass current state-of-the-art models on semantic correspondence learning benchmarks like SPair-71k, PF-PASCAL, and PF-WILLOW.
arXiv Detail & Related papers (2023-11-30T13:22:15Z) - Benchmarking the Robustness of LiDAR Semantic Segmentation Models [78.6597530416523]
In this paper, we aim to comprehensively analyze the robustness of LiDAR semantic segmentation models under various corruptions.
We propose a new benchmark called SemanticKITTI-C, which features 16 out-of-domain LiDAR corruptions in three groups, namely adverse weather, measurement noise and cross-device discrepancy.
We design a robust LiDAR segmentation model (RLSeg) which greatly boosts the robustness with simple but effective modifications.
arXiv Detail & Related papers (2023-01-03T06:47:31Z) - Adversarial Dual-Student with Differentiable Spatial Warping for
Semi-Supervised Semantic Segmentation [70.2166826794421]
We propose a differentiable geometric warping to conduct unsupervised data augmentation.
We also propose a novel adversarial dual-student framework to improve the Mean-Teacher.
Our solution significantly improves the performance and state-of-the-art results are achieved on both datasets.
arXiv Detail & Related papers (2022-03-05T17:36:17Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z) - Contrastive Self-supervised Sequential Recommendation with Robust
Augmentation [101.25762166231904]
Sequential Recommendationdescribes a set of techniques to model dynamic user behavior in order to predict future interactions in sequential user data.
Old and new issues remain, including data-sparsity and noisy data.
We propose Contrastive Self-Supervised Learning for sequential Recommendation (CoSeRec)
arXiv Detail & Related papers (2021-08-14T07:15:25Z) - BiSTF: Bilateral-Branch Self-Training Framework for Semi-Supervised
Large-scale Fine-Grained Recognition [28.06659482245647]
Semi-supervised Fine-Grained Recognition is a challenge task due to data imbalance, high interclass similarity and domain mismatch.
We propose Bilateral-Branch Self-Training Framework (BiSTF) to improve existing semi-balanced and domain-shifted fine-grained data.
We show BiSTF outperforms the existing state-of-the-art SSL on Semi-iNat dataset.
arXiv Detail & Related papers (2021-07-14T15:28:54Z) - Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition [55.362258027878966]
We present momentum pseudo-labeling (MPL) as a simple yet effective strategy for semi-supervised speech recognition.
MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method.
The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios.
arXiv Detail & Related papers (2021-06-16T16:24:55Z) - ANL: Anti-Noise Learning for Cross-Domain Person Re-Identification [25.035093667770052]
We propose an Anti-Noise Learning (ANL) approach, which contains two modules.
FDA module is designed to gather the id-related samples and disperse id-unrelated samples, through the camera-wise contrastive learning and adversarial adaptation.
Reliable Sample Selection ( RSS) module utilizes an Auxiliary Model to correct noisy labels and select reliable samples for the Main Model.
arXiv Detail & Related papers (2020-12-27T02:38:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.