Related papers: Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach

URL: http://arxiv.org/abs/2512.17367v2
Date: Thu, 25 Dec 2025 01:43:38 GMT
Title: Adversarially Robust Detection of Harmful Online Content: A Computational Design Science Approach
Authors: Yidong Chai, Yi Liu, Mohammadreza Ebrahimi, Weifeng Li, Balaji Padmanabhan,
Abstract summary: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric.<n>Machine learning models are widely adopted to detect such content.<n>Enhancing adversarial robustness is therefore essential.
Score: 7.918330138372274
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Social media platforms are plagued by harmful content such as hate speech, misinformation, and extremist rhetoric. Machine learning (ML) models are widely adopted to detect such content; however, they remain highly vulnerable to adversarial attacks, wherein malicious users subtly modify text to evade detection. Enhancing adversarial robustness is therefore essential, requiring detectors that can defend against diverse attacks (generalizability) while maintaining high overall accuracy. However, simultaneously achieving both optimal generalizability and accuracy is challenging. Following the computational design science paradigm, this study takes a sequential approach that first proposes a novel framework (Large Language Model-based Sample Generation and Aggregation, LLM-SGA) by identifying the key invariances of textual adversarial attacks and leveraging them to ensure that a detector instantiated within the framework has strong generalizability. Second, we instantiate our detector (Adversarially Robust Harmful Online Content Detector, ARHOCD) with three novel design components to improve detection accuracy: (1) an ensemble of multiple base detectors that exploits their complementary strengths; (2) a novel weight assignment method that dynamically adjusts weights based on each sample's predictability and each base detector's capability, with weights initialized using domain knowledge and updated via Bayesian inference; and (3) a novel adversarial training strategy that iteratively optimizes both the base detectors and the weight assignor. We addressed several limitations of existing adversarial robustness enhancement research and empirically evaluated ARHOCD across three datasets spanning hate speech, rumor, and extremist content. Results show that ARHOCD offers strong generalizability and improves detection accuracy under adversarial conditions.

Related papers

Benchmarking Adversarial Robustness and Adversarial Training Strategies for Object Detection [24.70528833663651]
Object detection models are critical components of automated systems, such as autonomous vehicles and perception-based robots.<n>Progress in defending these models lags behind classification, hindered by a lack of standardized evaluation.<n>It is nearly impossible to thoroughly compare attack or defense methods, as existing work uses different datasets, inconsistent efficiency metrics, and varied measures of perturbation cost.
arXiv Detail & Related papers (2026-02-18T14:33:58Z)
Elevating Intrusion Detection and Security Fortification in Intelligent Networks through Cutting-Edge Machine Learning Paradigms [5.706727902661187]
This study proposes a robust multiclass machine learning based intrusion detection framework.<n>It integrates advanced feature selection techniques to identify critical attributes, mitigating redundancy and enhancing detection accuracy.<n>The proposed ensemble architecture achieves superior performance, with an accuracy of 98%, precision of 98%, recall of 98%, and a false positive rate of just 2%.
arXiv Detail & Related papers (2025-12-22T05:14:26Z)
Spoofing-aware Prompt Learning for Unified Physical-Digital Facial Attack Detection [28.74960061024677]
Real-world face recognition systems are vulnerable to both physical presentation attacks (PAs) and digital forgery attacks (DFs)<n>We propose a Spoofing-aware Prompt Learning for Unified Attack Detection (SPL-UAD) framework, which decouples optimization branches for physical and digital attacks in the prompt space.<n>Experiments on the large-scale UniAttackDataPlus dataset demonstrate that the proposed method achieves significant performance improvements in unified attack detection tasks.
arXiv Detail & Related papers (2025-12-06T09:34:39Z)
Modeling the Attack: Detecting AI-Generated Text by Quantifying Adversarial Perturbations [2.7620215077666557]
Modern detectors are notoriously vulnerable to adversarial attacks, with paraphrasing standing out as an effective evasion technique.<n>This paper presents a comparative study of adversarial robustness, first by quantifying the limitations of standard adversarial training.<n>We then introduce a novel, significantly more resilient detection framework: Perturbation-Invariant Feature Engineering.
arXiv Detail & Related papers (2025-09-22T13:03:53Z)
Lie Detector: Unified Backdoor Detection via Cross-Examination Framework [68.45399098884364]
We propose a unified backdoor detection framework in the semi-honest setting.<n>Our method achieves superior detection performance, improving accuracy by 5.4%, 1.6%, and 11.9% over SoTA baselines.<n> Notably, it is the first to effectively detect backdoors in multimodal large language models.
arXiv Detail & Related papers (2025-03-21T06:12:06Z)
Leveraging Mixture of Experts for Improved Speech Deepfake Detection [53.69740463004446]
Speech deepfakes pose a significant threat to personal security and content authenticity. We introduce a novel approach for enhancing speech deepfake detection performance using a Mixture of Experts architecture.
arXiv Detail & Related papers (2024-09-24T13:24:03Z)
Meta Invariance Defense Towards Generalizable Robustness to Unknown Adversarial Attacks [62.036798488144306]
Current defense mainly focuses on the known attacks, but the adversarial robustness to the unknown attacks is seriously overlooked. We propose an attack-agnostic defense method named Meta Invariance Defense (MID) We show that MID simultaneously achieves robustness to the imperceptible adversarial perturbations in high-level image classification and attack-suppression in low-level robust image regeneration.
arXiv Detail & Related papers (2024-04-04T10:10:38Z)
Humanizing Machine-Generated Content: Evading AI-Text Detection through Adversarial Attack [24.954755569786396]
We propose a framework for a broader class of adversarial attacks, designed to perform minor perturbations in machine-generated content to evade detection. We consider two attack settings: white-box and black-box, and employ adversarial learning in dynamic scenarios to assess the potential enhancement of the current detection model's robustness. The empirical results reveal that the current detection models can be compromised in as little as 10 seconds, leading to the misclassification of machine-generated text as human-written content.
arXiv Detail & Related papers (2024-04-02T12:49:22Z)
Enhancing Infrared Small Target Detection Robustness with Bi-Level Adversarial Framework [61.34862133870934]
We propose a bi-level adversarial framework to promote the robustness of detection in the presence of distinct corruptions. Our scheme remarkably improves 21.96% IOU across a wide array of corruptions and notably promotes 4.97% IOU on the general benchmark.
arXiv Detail & Related papers (2023-09-03T06:35:07Z)
Doubly Robust Instance-Reweighted Adversarial Training [107.40683655362285]
We propose a novel doubly-robust instance reweighted adversarial framework. Our importance weights are obtained by optimizing the KL-divergence regularized loss function. Our proposed approach outperforms related state-of-the-art baseline methods in terms of average robust performance.
arXiv Detail & Related papers (2023-08-01T06:16:18Z)
Spatial-Frequency Discriminability for Revealing Adversarial Perturbations [53.279716307171604]
Vulnerability of deep neural networks to adversarial perturbations has been widely perceived in the computer vision community. Current algorithms typically detect adversarial patterns through discriminative decomposition for natural and adversarial data. We propose a discriminative detector relying on a spatial-frequency Krawtchouk decomposition.
arXiv Detail & Related papers (2023-05-18T10:18:59Z)
A Comprehensive Study of the Robustness for LiDAR-based 3D Object Detectors against Adversarial Attacks [84.10546708708554]
3D object detectors are increasingly crucial for security-critical tasks. It is imperative to understand their robustness against adversarial attacks. This paper presents the first comprehensive evaluation and analysis of the robustness of LiDAR-based 3D detectors under adversarial attacks.
arXiv Detail & Related papers (2022-12-20T13:09:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.