UPTON: Preventing Authorship Leakage from Public Text Release via Data
Poisoning
- URL: http://arxiv.org/abs/2211.09717v3
- Date: Wed, 25 Oct 2023 01:48:21 GMT
- Title: UPTON: Preventing Authorship Leakage from Public Text Release via Data
Poisoning
- Authors: Ziyao Wang, Thai Le and Dongwon Lee
- Abstract summary: We present a novel solution, UPTON, that exploits black-box data poisoning methods to weaken the authorship features in training samples.
We present empirical validation where UPTON successfully downgrades the accuracy of AA models to the impractical level.
UPTON remains effective to AA models that are already trained on available clean writings of authors.
- Score: 17.956089294338984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Consider a scenario where an author-e.g., activist, whistle-blower, with many
public writings wishes to write "anonymously" when attackers may have already
built an authorship attribution (AA) model based off of public writings
including those of the author. To enable her wish, we ask a question "Can one
make the publicly released writings, T, unattributable so that AA models
trained on T cannot attribute its authorship well?" Toward this question, we
present a novel solution, UPTON, that exploits black-box data poisoning methods
to weaken the authorship features in training samples and make released texts
unlearnable. It is different from previous obfuscation works-e.g., adversarial
attacks that modify test samples or backdoor works that only change the model
outputs when triggering words occur. Using four authorship datasets (IMDb10,
IMDb64, Enron, and WJO), we present empirical validation where UPTON
successfully downgrades the accuracy of AA models to the impractical level
(~35%) while keeping texts still readable (semantic similarity>0.9). UPTON
remains effective to AA models that are already trained on available clean
writings of authors.
Related papers
- Forging the Forger: An Attempt to Improve Authorship Verification via Data Augmentation [52.72682366640554]
Authorship Verification (AV) is a text classification task concerned with inferring whether a candidate text has been written by one specific author or by someone else.
It has been shown that many AV systems are vulnerable to adversarial attacks, where a malicious author actively tries to fool the classifier by either concealing their writing style, or by imitating the style of another author.
arXiv Detail & Related papers (2024-03-17T16:36:26Z) - ALISON: Fast and Effective Stylometric Authorship Obfuscation [14.297046770461264]
Authorship Attribution (AA) and Authorship Obfuscation (AO) are two competing tasks of increasing importance in privacy research.
We propose a practical AO method, ALISON, that dramatically reduces training/obfuscation time.
We also demonstrate that ALISON can effectively prevent four SOTA AA methods from accurately determining the authorship of ChatGPT-generated texts.
arXiv Detail & Related papers (2024-02-01T18:22:32Z) - Punctuation Matters! Stealthy Backdoor Attack for Language Models [36.91297828347229]
A backdoored model produces normal outputs on the clean samples while performing improperly on the texts.
Some attack methods even cause grammatical issues or change the semantic meaning of the original texts.
We propose a novel stealthy backdoor attack method against textual models, which is called textbfPuncAttack.
arXiv Detail & Related papers (2023-12-26T03:26:20Z) - Understanding writing style in social media with a supervised
contrastively pre-trained transformer [57.48690310135374]
Online Social Networks serve as fertile ground for harmful behavior, ranging from hate speech to the dissemination of disinformation.
We introduce the Style Transformer for Authorship Representations (STAR), trained on a large corpus derived from public sources of 4.5 x 106 authored texts.
Using a support base of 8 documents of 512 tokens, we can discern authors from sets of up to 1616 authors with at least 80% accuracy.
arXiv Detail & Related papers (2023-10-17T09:01:17Z) - Are You Copying My Model? Protecting the Copyright of Large Language
Models for EaaS via Backdoor Watermark [58.60940048748815]
Companies have begun to offer Embedding as a Service (E) based on large language models (LLMs)
E is vulnerable to model extraction attacks, which can cause significant losses for the owners of LLMs.
We propose an Embedding Watermark method called EmbMarker that implants backdoors on embeddings.
arXiv Detail & Related papers (2023-05-17T08:28:54Z) - Verifying the Robustness of Automatic Credibility Assessment [50.55687778699995]
We show that meaning-preserving changes in input text can mislead the models.
We also introduce BODEGA: a benchmark for testing both victim models and attack methods on misinformation detection tasks.
Our experimental results show that modern large language models are often more vulnerable to attacks than previous, smaller solutions.
arXiv Detail & Related papers (2023-03-14T16:11:47Z) - MSDT: Masked Language Model Scoring Defense in Text Domain [16.182765935007254]
We will introduce a novel improved textual backdoor defense method, named MSDT, that outperforms the current existing defensive algorithms in specific datasets.
experimental results illustrate that our method can be effective and constructive in terms of defending against backdoor attack in text domain.
arXiv Detail & Related papers (2022-11-10T06:46:47Z) - Towards Variable-Length Textual Adversarial Attacks [68.27995111870712]
It is non-trivial to conduct textual adversarial attacks on natural language processing tasks due to the discreteness of data.
In this paper, we propose variable-length textual adversarial attacks(VL-Attack)
Our method can achieve $33.18$ BLEU score on IWSLT14 German-English translation, achieving an improvement of $1.47$ over the baseline model.
arXiv Detail & Related papers (2021-04-16T14:37:27Z) - Be Careful about Poisoned Word Embeddings: Exploring the Vulnerability
of the Embedding Layers in NLP Models [27.100909068228813]
Recent studies have revealed a security threat to natural language processing (NLP) models, called the Backdoor Attack.
In this paper, we find that it is possible to hack the model in a data-free way by modifying one single word embedding vector.
Experimental results on sentiment analysis and sentence-pair classification tasks show that our method is more efficient and stealthier.
arXiv Detail & Related papers (2021-03-29T12:19:45Z) - Concealed Data Poisoning Attacks on NLP Models [56.794857982509455]
Adversarial attacks alter NLP model predictions by perturbing test-time inputs.
We develop a new data poisoning attack that allows an adversary to control model predictions whenever a desired trigger phrase is present in the input.
arXiv Detail & Related papers (2020-10-23T17:47:06Z) - Natural Backdoor Attack on Text Data [15.35163515187413]
In this paper, we propose the textitbackdoor attacks on NLP models.
We exploit the various attack strategies to generate trigger on text data and investigate different types of triggers based on modification scope, human recognition, and special cases.
The results show the excellent performance of with 100% backdoor attacks success rate and sacrificing of 0.83% on the text classification task.
arXiv Detail & Related papers (2020-06-29T16:40:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.