Related papers: Not-in-Perspective: Towards Shielding Google's Perspective API Against Adversarial Negation Attacks

Not-in-Perspective: Towards Shielding Google's Perspective API Against Adversarial Negation Attacks

URL: http://arxiv.org/abs/2602.09343v1
Date: Tue, 10 Feb 2026 02:27:28 GMT
Title: Not-in-Perspective: Towards Shielding Google's Perspective API Against Adversarial Negation Attacks
Authors: Michail S. Alexiou, J. Sukarno Mertoguno,
Abstract summary: cyberbullying has escalated the need for effective ways to monitor and moderate online interactions.<n>Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms.<n>We present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems.
Score: 1.675857332621569
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of cyberbullying in social media platforms involving toxic comments has escalated the need for effective ways to monitor and moderate online interactions. Existing solutions of automated toxicity detection systems, are based on a machine or deep learning algorithms. However, statistics-based solutions are generally prone to adversarial attacks that contain logic based modifications such as negation in phrases and sentences. In that regard, we present a set of formal reasoning-based methodologies that wrap around existing machine learning toxicity detection systems. Acting as both pre-processing and post-processing steps, our formal reasoning wrapper helps alleviating the negation attack problems and significantly improves the accuracy and efficacy of toxicity scoring. We evaluate different variations of our wrapper on multiple machine learning models against a negation adversarial dataset. Experimental results highlight the improvement of hybrid (formal reasoning and machine-learning) methods against various purely statistical solutions.

Related papers

Self-Supervised Learning Strategies for a Platform to Test the Toxicity of New Chemicals and Materials [1.2197883665266451]
We demonstrate how representations learned via self-supervised learning can effectively identify toxicant-induced changes.<n>Our analysis shows that the learned representations using self-supervised learning are suitable for effectively distinguishing between the modes-of-action of different compounds.
arXiv Detail & Related papers (2025-10-09T06:51:12Z)
U-GIFT: Uncertainty-Guided Firewall for Toxic Speech in Few-Shot Scenario [13.954929026841413]
We propose an uncertainty-guided firewall for toxic speech in few-shot scenarios, U-GIFT.<n>U-GIFT combines active learning with Bayesian Neural Networks (BNNs) to automatically identify high-quality samples from unlabeled data.<n>In the 5-shot setting, it achieves a 14.92% performance improvement over the basic model.
arXiv Detail & Related papers (2025-01-01T17:47:22Z)
A Hybrid Framework for Statistical Feature Selection and Image-Based Noise-Defect Detection [55.2480439325792]
This paper presents a hybrid framework that integrates both statistical feature selection and classification techniques to improve defect detection accuracy.<n>We present around 55 distinguished features that are extracted from industrial images, which are then analyzed using statistical methods.<n>By integrating these methods with flexible machine learning applications, the proposed framework improves detection accuracy and reduces false positives and misclassifications.
arXiv Detail & Related papers (2024-12-11T22:12:21Z)
Indiscriminate Disruption of Conditional Inference on Multivariate Gaussians [60.22542847840578]
Despite advances in adversarial machine learning, inference for Gaussian models in the presence of an adversary is notably understudied. We consider a self-interested attacker who wishes to disrupt a decisionmaker's conditional inference and subsequent actions by corrupting a set of evidentiary variables. To avoid detection, the attacker also desires the attack to appear plausible wherein plausibility is determined by the density of the corrupted evidence.
arXiv Detail & Related papers (2024-11-21T17:46:55Z)
Empirical Perturbation Analysis of Linear System Solvers from a Data Poisoning Perspective [16.569765598914152]
We investigate how errors in the input data will affect the fitting error and accuracy of the solution from a linear system-solving algorithm under perturbations common in adversarial attacks. We propose data perturbation through two distinct knowledge levels, developing a poisoning optimization and studying two methods of perturbation: Label-guided Perturbation (LP) and Unconditioning Perturbation (UP) Under the circumstance that the data is intentionally perturbed -- as is the case with data poisoning -- we seek to understand how different kinds of solvers react to these perturbations, identifying those algorithms most impacted by different types of adversarial attacks.
arXiv Detail & Related papers (2024-10-01T17:14:05Z)
Multi-agent Reinforcement Learning-based Network Intrusion Detection System [3.4636217357968904]
Intrusion Detection Systems (IDS) play a crucial role in ensuring the security of computer networks. We propose a novel multi-agent reinforcement learning (RL) architecture, enabling automatic, efficient, and robust network intrusion detection. Our solution introduces a resilient architecture designed to accommodate the addition of new attacks and effectively adapt to changes in existing attack patterns.
arXiv Detail & Related papers (2024-07-08T09:18:59Z)
Analyzing Adversarial Inputs in Deep Reinforcement Learning [53.3760591018817]
We present a comprehensive analysis of the characterization of adversarial inputs, through the lens of formal verification. We introduce a novel metric, the Adversarial Rate, to classify models based on their susceptibility to such perturbations. Our analysis empirically demonstrates how adversarial inputs can affect the safety of a given DRL system with respect to such perturbations.
arXiv Detail & Related papers (2024-02-07T21:58:40Z)
Interactive System-wise Anomaly Detection [66.3766756452743]
Anomaly detection plays a fundamental role in various applications. It is challenging for existing methods to handle the scenarios where the instances are systems whose characteristics are not readily observed as data. We develop an end-to-end approach which includes an encoder-decoder module that learns system embeddings.
arXiv Detail & Related papers (2023-04-21T02:20:24Z)
NoisyHate: Mining Online Human-Written Perturbations for Realistic Robustness Benchmarking of Content Moderation Models [13.887401380190335]
We introduce a novel, high-quality dataset of human-written perturbations, named as NoisyHate.<n>We show that perturbations in NoisyHate have different characteristics than prior algorithm-generated toxic datasets show.
arXiv Detail & Related papers (2023-03-18T14:54:57Z)
On the Robustness of Random Forest Against Untargeted Data Poisoning: An Ensemble-Based Approach [42.81632484264218]
In machine learning models, perturbations of fractions of the training set (poisoning) can seriously undermine the model accuracy. This paper aims to implement a novel hash-based ensemble approach that protects random forest against untargeted, random poisoning attacks.
arXiv Detail & Related papers (2022-09-28T11:41:38Z)
Improving robustness of jet tagging algorithms with adversarial training [56.79800815519762]
We investigate the vulnerability of flavor tagging algorithms via application of adversarial attacks. We present an adversarial training strategy that mitigates the impact of such simulated attacks.
arXiv Detail & Related papers (2022-03-25T19:57:19Z)
A Hamiltonian Monte Carlo Method for Probabilistic Adversarial Attack and Learning [122.49765136434353]
We present an effective method, called Hamiltonian Monte Carlo with Accumulated Momentum (HMCAM), aiming to generate a sequence of adversarial examples. We also propose a new generative method called Contrastive Adversarial Training (CAT), which approaches equilibrium distribution of adversarial examples. Both quantitative and qualitative analysis on several natural image datasets and practical systems have confirmed the superiority of the proposed algorithm.
arXiv Detail & Related papers (2020-10-15T16:07:26Z)
Adversarial Machine Learning in Network Intrusion Detection Systems [6.18778092044887]
We study the nature of the adversarial problem in Network Intrusion Detection Systems. We use evolutionary computation (particle swarm optimization and genetic algorithm) and deep learning (generative adversarial networks) as tools for adversarial example generation. Our work highlights the vulnerability of machine learning based NIDS in the face of adversarial perturbation.
arXiv Detail & Related papers (2020-04-23T19:47:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.