PETGEN: Personalized Text Generation Attack on Deep Sequence
Embedding-based Classification Models
- URL: http://arxiv.org/abs/2109.06777v1
- Date: Tue, 14 Sep 2021 15:48:07 GMT
- Title: PETGEN: Personalized Text Generation Attack on Deep Sequence
Embedding-based Classification Models
- Authors: Bing He, Mustaque Ahamad, Srijan Kumar
- Abstract summary: Malicious users can evade deep detection models by manipulating their behavior.
Here we create a novel adversarial attack model against deep user sequence embedding-based classification models.
In the attack, the adversary generates a new post to fool the classifier.
- Score: 9.630961791758168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: \textit{What should a malicious user write next to fool a detection model?}
Identifying malicious users is critical to ensure the safety and integrity of
internet platforms. Several deep learning based detection models have been
created. However, malicious users can evade deep detection models by
manipulating their behavior, rendering these models of little use. The
vulnerability of such deep detection models against adversarial attacks is
unknown. Here we create a novel adversarial attack model against deep user
sequence embedding-based classification models, which use the sequence of user
posts to generate user embeddings and detect malicious users. In the attack,
the adversary generates a new post to fool the classifier. We propose a novel
end-to-end Personalized Text Generation Attack model, called \texttt{PETGEN},
that simultaneously reduces the efficacy of the detection model and generates
posts that have several key desirable properties. Specifically, \texttt{PETGEN}
generates posts that are personalized to the user's writing style, have
knowledge about a given target context, are aware of the user's historical
posts on the target context, and encapsulate the user's recent topical
interests. We conduct extensive experiments on two real-world datasets (Yelp
and Wikipedia, both with ground-truth of malicious users) to show that
\texttt{PETGEN} significantly reduces the performance of popular deep user
sequence embedding-based classification models. \texttt{PETGEN} outperforms
five attack baselines in terms of text quality and attack efficacy in both
white-box and black-box classifier settings. Overall, this work paves the path
towards the next generation of adversary-aware sequence classification models.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - Are aligned neural networks adversarially aligned? [93.91072860401856]
adversarial users can construct inputs which circumvent attempts at alignment.
We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models.
We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
arXiv Detail & Related papers (2023-06-26T17:18:44Z) - Deconstructing Classifiers: Towards A Data Reconstruction Attack Against
Text Classification Models [2.9735729003555345]
We propose a new targeted data reconstruction attack called the Mix And Match attack.
This work highlights the importance of considering the privacy risks associated with data reconstruction attacks in classification models.
arXiv Detail & Related papers (2023-06-23T21:25:38Z) - Verifying the Robustness of Automatic Credibility Assessment [79.08422736721764]
Text classification methods have been widely investigated as a way to detect content of low credibility.
In some cases insignificant changes in input text can mislead the models.
We introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z) - Neural network fragile watermarking with no model performance
degradation [28.68910526223425]
We propose a novel neural network fragile watermarking with no model performance degradation.
Experiments show that the proposed method can effectively detect model malicious fine-tuning with no model performance degradation.
arXiv Detail & Related papers (2022-08-16T07:55:20Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Hidden Backdoor Attack against Semantic Segmentation Models [60.0327238844584]
The emphbackdoor attack intends to embed hidden backdoors in deep neural networks (DNNs) by poisoning training data.
We propose a novel attack paradigm, the emphfine-grained attack, where we treat the target label from the object-level instead of the image-level.
Experiments show that the proposed methods can successfully attack semantic segmentation models by poisoning only a small proportion of training data.
arXiv Detail & Related papers (2021-03-06T05:50:29Z) - Robust and Verifiable Information Embedding Attacks to Deep Neural
Networks via Error-Correcting Codes [81.85509264573948]
In the era of deep learning, a user often leverages a third-party machine learning tool to train a deep neural network (DNN) classifier.
In an information embedding attack, an attacker is the provider of a malicious third-party machine learning tool.
In this work, we aim to design information embedding attacks that are verifiable and robust against popular post-processing methods.
arXiv Detail & Related papers (2020-10-26T17:42:42Z) - TextDecepter: Hard Label Black Box Attack on Text Classifiers [0.0]
We present a novel approach for hard-label black-box attacks against Natural Language Processing (NLP) classifiers.
Such an attack scenario applies to real-world black-box models being used for security-sensitive applications such as sentiment analysis and toxic content detection.
arXiv Detail & Related papers (2020-08-16T08:57:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.