GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks
- URL: http://arxiv.org/abs/2507.14679v2
- Date: Thu, 24 Jul 2025 15:46:28 GMT
- Title: GCC-Spam: Spam Detection via GAN, Contrastive Learning, and Character Similarity Networks
- Authors: Zhijie Wang, Zixin Xu, Zhiyuan Pan,
- Abstract summary: We propose a novel spam-text detection framework, GCC-Spam, which integrates three core innovations.<n>Character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks.<n> contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts.<n>Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity.
- Score: 2.184092672461171
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The exponential growth of spam text on the Internet necessitates robust detection mechanisms to mitigate risks such as information leakage and social instability. This work addresses two principal challenges: adversarial strategies employed by spammers and the scarcity of labeled data. We propose a novel spam-text detection framework GCC-Spam, which integrates three core innovations. First, a character similarity network captures orthographic and phonetic features to counter character-obfuscation attacks and furthermore produces sentence embeddings for downstream classification. Second, contrastive learning enhances discriminability by optimizing the latent-space distance between spam and normal texts. Third, a Generative Adversarial Network (GAN) generates realistic pseudo-spam samples to alleviate data scarcity while improving model robustness and classification accuracy. Extensive experiments on real-world datasets demonstrate that our model outperforms baseline approaches, achieving higher detection rates with significantly fewer labeled examples.
Related papers
- Boosting Bot Detection via Heterophily-Aware Representation Learning and Prototype-Guided Cluster Discovery [16.548403922027248]
BotHP is a generative Graph Self-Supervised Learning framework tailored to boost graph-based bot detectors.<n>It uses a dual-encoder architecture, consisting of a graph-aware encoder to capture node commonality and a graph-agnostic encoder to preserve node uniqueness.<n>It consistently boosts graph-based bot detectors, improving detection performance, alleviating label reliance, and enhancing generalization capability.
arXiv Detail & Related papers (2025-06-01T12:44:53Z) - Hybrid Machine Learning Model for Detecting Bangla Smishing Text Using BERT and Character-Level CNN [0.0]
Smishing attacks have surged by 328%, posing a major threat to mobile users.<n>Despite its growing prevalence, the issue remains significantly under-addressed.<n>This paper presents a novel hybrid machine learning model for detecting Bangla smishing texts.
arXiv Detail & Related papers (2025-02-03T16:51:58Z) - SpaLLM-Guard: Pairing SMS Spam Detection Using Open-source and Commercial LLMs [1.3198171962008958]
We evaluate the potential of large language models (LLMs), both open-source and commercial, for SMS spam detection.<n>We compare their performance across zero-shot, few-shot, fine-tuning, and chain-of-thought prompting approaches.<n>Fine-tuning emerges as the most effective strategy, with Mixtral achieving 98.6% accuracy and a balanced false positive and false negative rate below 2%.
arXiv Detail & Related papers (2025-01-09T06:00:08Z) - Comprehensive Botnet Detection by Mitigating Adversarial Attacks, Navigating the Subtleties of Perturbation Distances and Fortifying Predictions with Conformal Layers [1.6001193161043425]
Botnets are computer networks controlled by malicious actors that present significant cybersecurity challenges.
This research addresses the sophisticated adversarial manipulations posed by attackers, aiming to undermine machine learning-based botnet detection systems.
We introduce a flow-based detection approach, leveraging machine learning and deep learning algorithms trained on the ISCX and ISOT datasets.
arXiv Detail & Related papers (2024-09-01T08:53:21Z) - Detecting, Explaining, and Mitigating Memorization in Diffusion Models [49.438362005962375]
We introduce a straightforward yet effective method for detecting memorized prompts by inspecting the magnitude of text-conditional predictions.
Our proposed method seamlessly integrates without disrupting sampling algorithms, and delivers high accuracy even at the first generation step.
Building on our detection strategy, we unveil an explainable approach that shows the contribution of individual words or tokens to memorization.
arXiv Detail & Related papers (2024-07-31T16:13:29Z) - Token-Level Adversarial Prompt Detection Based on Perplexity Measures
and Contextual Information [67.78183175605761]
Large Language Models are susceptible to adversarial prompt attacks.
This vulnerability underscores a significant concern regarding the robustness and reliability of LLMs.
We introduce a novel approach to detecting adversarial prompts at a token level.
arXiv Detail & Related papers (2023-11-20T03:17:21Z) - Text generation for dataset augmentation in security classification
tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - Learning-based Hybrid Local Search for the Hard-label Textual Attack [53.92227690452377]
We consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker could only access the prediction label.
Based on this observation, we propose a novel hard-label attack, called Learning-based Hybrid Local Search (LHLS) algorithm.
Our LHLS significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
arXiv Detail & Related papers (2022-01-20T14:16:07Z) - Deep convolutional forest: a dynamic deep ensemble approach for spam
detection in text [219.15486286590016]
This paper introduces a dynamic deep ensemble model for spam detection that adjusts its complexity and extracts features automatically.
As a result, the model achieved high precision, recall, f1-score and accuracy of 98.38%.
arXiv Detail & Related papers (2021-10-10T17:19:37Z) - Understanding Self-supervised Learning with Dual Deep Networks [74.92916579635336]
We propose a novel framework to understand contrastive self-supervised learning (SSL) methods that employ dual pairs of deep ReLU networks.
We prove that in each SGD update of SimCLR with various loss functions, the weights at each layer are updated by a emphcovariance operator.
To further study what role the covariance operator plays and which features are learned in such a process, we model data generation and augmentation processes through a emphhierarchical latent tree model (HLTM)
arXiv Detail & Related papers (2020-10-01T17:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.