Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data
- URL: http://arxiv.org/abs/2407.04990v1
- Date: Sat, 6 Jul 2024 07:51:24 GMT
- Title: Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data
- Authors: Ulin Nuha, Chih-Hsueh Lin,
- Abstract summary: We propose a conditional semi-supervised data augmentation for a spam detection model lacking the availability of data.
We exploit unlabeled data for data augmentation to extend training data.
Latent variables can come from labeled and unlabeled data as the input for the final classifier.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several machine learning schemes have attempted to perform the detection of spam messages. However, those schemes mostly require a huge amount of labeled data. The existing techniques addressing the lack of data availability have issues with effectiveness and robustness. Therefore, this paper proposes a conditional semi-supervised data augmentation (CSSDA) for a spam detection model lacking the availability of data. The main architecture of CSSDA comprises feature extraction and enhanced generative network. Here, we exploit unlabeled data for data augmentation to extend training data. The enhanced generative in our proposed scheme produces latent variables as fake samples from unlabeled data through a conditional scheme. Latent variables can come from labeled and unlabeled data as the input for the final classifier in our spam detection model. The experimental results indicate that our proposed CSSDA achieves excellent results compared to several related methods both exploiting unlabeled data and not. In the experiment stage with various amounts of unlabeled data, CSSDA is the only robust model that obtains a balanced accuracy of about 85% when the availability of labeled data is large. We also conduct several ablation studies to investigate our proposed scheme in detail. The result also shows that several ablation studies strengthen our proposed innovations. These experiments indicate that unlabeled data has a significant contribution to data augmentation using the conditional semi-supervised scheme for spam detection.
Related papers
- Deep Active Learning with Manifold-preserving Trajectory Sampling [2.0717982775472206]
Active learning (AL) is for optimizing the selection of unlabeled data for annotation (labeling)
Existing deep AL methods arguably suffer from bias incurred by clabeled data, which takes a much lower percentage than unlabeled data in AL context.
We propose a novel method, namely Manifold-Preserving Trajectory Sampling (MPTS), aiming to enforce the feature space learned from labeled data to represent a more accurate manifold.
arXiv Detail & Related papers (2024-10-21T03:04:09Z) - Continuous Contrastive Learning for Long-Tailed Semi-Supervised Recognition [50.61991746981703]
Current state-of-the-art LTSSL approaches rely on high-quality pseudo-labels for large-scale unlabeled data.
This paper introduces a novel probabilistic framework that unifies various recent proposals in long-tail learning.
We introduce a continuous contrastive learning method, CCL, extending our framework to unlabeled data using reliable and smoothed pseudo-labels.
arXiv Detail & Related papers (2024-10-08T15:06:10Z) - Empowering HWNs with Efficient Data Labeling: A Clustered Federated
Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges.
We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios.
Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z) - Semi-Supervised Object Detection with Uncurated Unlabeled Data for
Remote Sensing Images [16.660668160785615]
Semi-supervised object detection (SSOD) methods tackle this issue by generating pseudo-labels for the unlabeled data.
However, real-world situations introduce the possibility of out-of-distribution (OOD) samples being mixed with in-distribution (ID) samples within the unlabeled dataset.
We propose Open-Set Semi-Supervised Object Detection (OSSOD) on uncurated unlabeled data.
arXiv Detail & Related papers (2023-10-09T07:59:31Z) - Are labels informative in semi-supervised learning? -- Estimating and
leveraging the missing-data mechanism [4.675583319625962]
Semi-supervised learning is a powerful technique for leveraging unlabeled data to improve machine learning models.
It can be affected by the presence of informative'' labels, which occur when some classes are more likely to be labeled than others.
We propose a novel approach to address this issue by estimating the missing-data mechanism and using inverse propensity weighting to debias any SSL algorithm.
arXiv Detail & Related papers (2023-02-15T09:18:46Z) - Exploiting Mixed Unlabeled Data for Detecting Samples of Seen and Unseen
Out-of-Distribution Classes [5.623232537411766]
Out-of-Distribution (OOD) detection is essential in real-world applications, which has attracted increasing attention in recent years.
Most existing OOD detection methods require many labeled In-Distribution (ID) data, causing a heavy labeling cost.
In this paper, we focus on the more realistic scenario, where limited labeled data and abundant unlabeled data are available.
We propose the Adaptive In-Out-aware Learning (AIOL) method, in which we adaptively select potential ID and OOD samples from the mixed unlabeled data.
arXiv Detail & Related papers (2022-10-13T08:34:25Z) - Prompt-driven efficient Open-set Semi-supervised Learning [52.30303262499391]
Open-set semi-supervised learning (OSSL) has attracted growing interest, which investigates a more practical scenario where out-of-distribution (OOD) samples are only contained in unlabeled data.
We propose a prompt-driven efficient OSSL framework, called OpenPrompt, which can propagate class information from labeled to unlabeled data with only a small number of trainable parameters.
arXiv Detail & Related papers (2022-09-28T16:25:08Z) - ADT-SSL: Adaptive Dual-Threshold for Semi-Supervised Learning [68.53717108812297]
Semi-Supervised Learning (SSL) has advanced classification tasks by inputting both labeled and unlabeled data to train a model jointly.
This paper proposes an Adaptive Dual-Threshold method for Semi-Supervised Learning (ADT-SSL)
Experimental results show that the proposed ADT-SSL achieves state-of-the-art classification accuracy.
arXiv Detail & Related papers (2022-05-21T11:52:08Z) - Self-Trained One-class Classification for Unsupervised Anomaly Detection [56.35424872736276]
Anomaly detection (AD) has various applications across domains, from manufacturing to healthcare.
In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples.
To tackle this problem, we build a robust one-class classification framework via data refinement.
We show that our method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
arXiv Detail & Related papers (2021-06-11T01:36:08Z) - Negative Data Augmentation [127.28042046152954]
We show that negative data augmentation samples provide information on the support of the data distribution.
We introduce a new GAN training objective where we use NDA as an additional source of synthetic data for the discriminator.
Empirically, models trained with our method achieve improved conditional/unconditional image generation along with improved anomaly detection capabilities.
arXiv Detail & Related papers (2021-02-09T20:28:35Z) - Multi-Task Curriculum Framework for Open-Set Semi-Supervised Learning [54.85397562961903]
Semi-supervised learning (SSL) has been proposed to leverage unlabeled data for training powerful models when only limited labeled data is available.
We address a more complex novel scenario named open-set SSL, where out-of-distribution (OOD) samples are contained in unlabeled data.
Our method achieves state-of-the-art results by successfully eliminating the effect of OOD samples.
arXiv Detail & Related papers (2020-07-22T10:33:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.