Fake detection in imbalance dataset by Semi-supervised learning with GAN
- URL: http://arxiv.org/abs/2212.01071v5
- Date: Wed, 20 Dec 2023 08:18:14 GMT
- Title: Fake detection in imbalance dataset by Semi-supervised learning with GAN
- Authors: Jinus Bordbar, Saman Ardalan, Mohammadreza Mohammadrezaie, Zahra
Ghasemi
- Abstract summary: Our study contributes to the field by achieving an 81% accuracy in detecting fake accounts using only 100 labeled samples.
This demonstrates the potential of SGAN as a powerful tool for handling minority classes and addressing big data challenges in fake account detection.
- Score: 1.4542411354617986
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As social media continues to grow rapidly, the prevalence of harassment on
these platforms has also increased. This has piqued the interest of researchers
in the field of fake detection. Social media data, often forms complex graphs
with numerous nodes, posing several challenges. These challenges and
limitations include dealing with a significant amount of irrelevant features in
matrices and addressing issues such as high data dispersion and an imbalanced
class distribution within the dataset. To overcome these challenges and
limitations, researchers have employed auto-encoders and a combination of
semi-supervised learning with a GAN algorithm, referred to as SGAN. Our
proposed method utilizes auto-encoders for feature extraction and incorporates
SGAN. By leveraging an unlabeled dataset, the unsupervised layer of SGAN
compensates for the limited availability of labeled data, making efficient use
of the limited number of labeled instances. Multiple evaluation metrics were
employed, including the Confusion Matrix and the ROC curve. The dataset was
divided into training and testing sets, with 100 labeled samples for training
and 1,000 samples for testing. The novelty of our research lies in applying
SGAN to address the issue of imbalanced datasets in fake account detection. By
optimizing the use of a smaller number of labeled instances and reducing the
need for extensive computational power, our method offers a more efficient
solution. Additionally, our study contributes to the field by achieving an 81%
accuracy in detecting fake accounts using only 100 labeled samples. This
demonstrates the potential of SGAN as a powerful tool for handling minority
classes and addressing big data challenges in fake account detection.
Related papers
- Conditional Semi-Supervised Data Augmentation for Spam Message Detection with Low Resource Data [0.0]
We propose a conditional semi-supervised data augmentation for a spam detection model lacking the availability of data.
We exploit unlabeled data for data augmentation to extend training data.
Latent variables can come from labeled and unlabeled data as the input for the final classifier.
arXiv Detail & Related papers (2024-07-06T07:51:24Z) - Advancing Anomaly Detection: Non-Semantic Financial Data Encoding with LLMs [49.57641083688934]
We introduce a novel approach to anomaly detection in financial data using Large Language Models (LLMs) embeddings.
Our experiments demonstrate that LLMs contribute valuable information to anomaly detection as our models outperform the baselines.
arXiv Detail & Related papers (2024-06-05T20:19:09Z) - Empowering HWNs with Efficient Data Labeling: A Clustered Federated
Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges.
We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios.
Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z) - Tackling Diverse Minorities in Imbalanced Classification [80.78227787608714]
Imbalanced datasets are commonly observed in various real-world applications, presenting significant challenges in training classifiers.
We propose generating synthetic samples iteratively by mixing data samples from both minority and majority classes.
We demonstrate the effectiveness of our proposed framework through extensive experiments conducted on seven publicly available benchmark datasets.
arXiv Detail & Related papers (2023-08-28T18:48:34Z) - Semi-supervised binary classification with latent distance learning [0.0]
We propose a new learning representation to solve the binary classification problem using a few labels with a random k-pair cross-distance learning mechanism.
With few labels and without any data augmentation techniques, the proposed method outperformed state-of-the-art semi-supervised and self-supervised learning methods.
arXiv Detail & Related papers (2022-11-28T09:05:26Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Collaborative Intelligence Orchestration: Inconsistency-Based Fusion of
Semi-Supervised Learning and Active Learning [60.26659373318915]
Active learning (AL) and semi-supervised learning (SSL) are two effective, but often isolated, means to alleviate the data-hungry problem.
We propose an innovative Inconsistency-based virtual aDvErial algorithm to further investigate SSL-AL's potential superiority.
Two real-world case studies visualize the practical industrial value of applying and deploying the proposed data sampling algorithm.
arXiv Detail & Related papers (2022-06-07T13:28:43Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.