Related papers: PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models

URL: http://arxiv.org/abs/2109.06777v1
Date: Tue, 14 Sep 2021 15:48:07 GMT
Title: PETGEN: Personalized Text Generation Attack on Deep Sequence Embedding-based Classification Models
Authors: Bing He, Mustaque Ahamad, Srijan Kumar
Abstract summary: Malicious users can evade deep detection models by manipulating their behavior. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models. In the attack, the adversary generates a new post to fool the classifier.
Score: 9.630961791758168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: \textit{What should a malicious user write next to fool a detection model?} Identifying malicious users is critical to ensure the safety and integrity of internet platforms. Several deep learning based detection models have been created. However, malicious users can evade deep detection models by manipulating their behavior, rendering these models of little use. The vulnerability of such deep detection models against adversarial attacks is unknown. Here we create a novel adversarial attack model against deep user sequence embedding-based classification models, which use the sequence of user posts to generate user embeddings and detect malicious users. In the attack, the adversary generates a new post to fool the classifier. We propose a novel end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN}, that simultaneously reduces the efficacy of the detection model and generates posts that have several key desirable properties. Specifically, \texttt{PETGEN} generates posts that are personalized to the user's writing style, have knowledge about a given target context, are aware of the user's historical posts on the target context, and encapsulate the user's recent topical interests. We conduct extensive experiments on two real-world datasets (Yelp and Wikipedia, both with ground-truth of malicious users) to show that \texttt{PETGEN} significantly reduces the performance of popular deep user sequence embedding-based classification models. \texttt{PETGEN} outperforms five attack baselines in terms of text quality and attack efficacy in both white-box and black-box classifier settings. Overall, this work paves the path towards the next generation of adversary-aware sequence classification models.

Related papers

TokenBreak: Bypassing Text Classification Models Through Token Manipulation [0.0]
Text classification models can be implemented to guard against threats such as prompt injection attacks against Large Language Models (LLMs)<n>We introduce TokenBreak: a novel attack that can bypass these protection models by taking advantage of the tokenization strategy they use.<n> Importantly, the end target (LLM or email recipient) can still understand and respond to the manipulated text and therefore be vulnerable to the very attack the protection model was put in place to prevent.
arXiv Detail & Related papers (2025-06-09T17:11:28Z)
Benchmarking Unified Face Attack Detection via Hierarchical Prompt Tuning [58.16354555208417]
PAD and FFD are proposed to protect face data from physical media-based Presentation Attacks and digital editing-based DeepFakes, respectively.<n>The lack of a Unified Face Attack Detection model to simultaneously handle attacks in these two categories is mainly attributed to two factors.<n>We present a novel Visual-Language Model-based Hierarchical Prompt Tuning Framework that adaptively explores multiple classification criteria from different semantic spaces.
arXiv Detail & Related papers (2025-05-19T16:35:45Z)
AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries. We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots. We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z)
Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else. It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z)
Mitigating Adversarial Attacks in Federated Learning with Trusted Execution Environments [1.8240624028534085]
In image-based applications, adversarial examples consist of images slightly perturbed to the human eye getting misclassified by the local model. Pelta is a novel shielding mechanism leveraging Trusted Execution Environments (TEEs) that reduce the ability of attackers to craft adversarial samples. We show the effectiveness of Pelta in mitigating six white-box state-of-the-art adversarial attacks.
arXiv Detail & Related papers (2023-09-13T14:19:29Z)
Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z)
Deconstructing Classifiers: Towards A Data Reconstruction Attack Against Text Classification Models [2.9735729003555345]
We propose a new targeted data reconstruction attack called the Mix And Match attack. This work highlights the importance of considering the privacy risks associated with data reconstruction attacks in classification models.
arXiv Detail & Related papers (2023-06-23T21:25:38Z)
Ensemble-based Blackbox Attacks on Dense Prediction [16.267479602370543]
We show that a carefully designed ensemble can create effective attacks for a number of victim models. In particular, we show that normalization of the weights for individual models plays a critical role in the success of the attacks. Our proposed method can also generate a single perturbation that can fool multiple blackbox detection and segmentation models simultaneously.
arXiv Detail & Related papers (2023-03-25T00:08:03Z)
Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models. We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks. Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z)
Object-fabrication Targeted Attack for Object Detection [54.10697546734503]
adversarial attack for object detection contains targeted attack and untargeted attack. New object-fabrication targeted attack mode can mislead detectors tofabricate extra false objects with specific target labels.
arXiv Detail & Related papers (2022-12-13T08:42:39Z)
MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets. experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z)
Neural network fragile watermarking with no model performance degradation [28.68910526223425]
We propose a novel neural network fragile watermarking with no model performance degradation. Experiments show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.
arXiv Detail & Related papers (2022-08-16T07:55:20Z)
Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data. In this paper, we propose variable-length textual adversarial attacks(VL-Attack) Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z)
Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data. We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level. Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z)
Robust and Verifiable Information Embedding Attacks to Deep Neural Networks via Error-Correcting Codes [81.85509264573948]
In the era of deep learning, a user often leverages a third-party machine learning tool to train a deep neural network (DNN) classifier. In an information embedding attack, an attacker is the provider of a malicious third-party machine learning tool. In this work, we aim to design information embedding attacks that are verifiable and robust against popular post-processing methods.
arXiv Detail & Related papers (2020-10-26T17:42:42Z)
TextDecepter: Hard Label Black Box Attack on Text Classifiers [0.0]
We present a novel approach for hard-label black-box attacks against Natural Language Processing (NLP) classifiers. Such an attack scenario applies to real-world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.
arXiv Detail & Related papers (2020-08-16T08:57:01Z)
Defense for Black-box Attacks on Anti-spoofing Models by Self-Supervised Learning [71.17774313301753]
We explore the robustness of self-supervised learned high-level representations by using them in the defense against adversarial attacks. Experimental results on the ASVspoof 2019 dataset demonstrate that high-level representations extracted by Mockingjay can prevent the transferability of adversarial examples.
arXiv Detail & Related papers (2020-06-05T03:03:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.