Related papers: Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text

Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text

URL: http://arxiv.org/abs/2405.19831v2
Date: Fri, 31 May 2024 07:24:55 GMT
Title: Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text
Authors: Stephen Meisenbacher, Florian Matthes,
Abstract summary: We propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts. Our results show that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations.
Score: 3.3916160303055567
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The study of Differential Privacy (DP) in Natural Language Processing often views the task of text privatization as a $\textit{rewriting}$ task, in which sensitive input texts are rewritten to hide explicit or implicit private information. In order to evaluate the privacy-preserving capabilities of a DP text rewriting mechanism, $\textit{empirical privacy}$ tests are frequently employed. In these tests, an adversary is modeled, who aims to infer sensitive information (e.g., gender) about the author behind a (privatized) text. Looking to improve the empirical protections provided by DP rewriting methods, we propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts, where DP rewritten texts are rewritten $\textit{again}$. Our results show that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations. Therefore, our approach raises the bar for DP rewriting methods in their empirical privacy evaluations, providing an extra layer of protection against malicious adversaries.

Related papers

Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting [3.0177210416625124]
We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
arXiv Detail & Related papers (2025-03-28T12:33:46Z)
Privacy-Preserving Biometric Verification with Handwritten Random Digit String [49.77172854374479]
Handwriting verification has stood as a steadfast identity authentication method for decades. However, this technique risks potential privacy breaches due to the inclusion of personal information in handwritten biometrics such as signatures. We propose using the Random Digit String (RDS) for privacy-preserving handwriting verification.
arXiv Detail & Related papers (2025-03-17T03:47:25Z)
Investigating User Perspectives on Differentially Private Text Privatization [81.59631769859004]
This work investigates how factors of $textitscenario$, $textitdata sensitivity$, $textitmechanism type$, and $textitreason for data collection$ impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts.
arXiv Detail & Related papers (2025-03-12T12:33:20Z)
TextSleuth: Towards Explainable Tampered Text Detection [49.88698441048043]
We propose to explain the basis of tampered text detection with natural language via large multimodal models. To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD. Elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o. To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts.
arXiv Detail & Related papers (2024-12-19T13:10:03Z)
Truthful Text Sanitization Guided by Inference Attacks [2.824895388993495]
The purpose of text sanitization is to rewrite those text spans in a document that may directly or indirectly identify an individual. We present an automated text sanitization strategy based on generalizations that subsume the semantic content of the original text spans.
arXiv Detail & Related papers (2024-12-17T14:07:01Z)
TextDestroyer: A Training- and Annotation-Free Diffusion Method for Destroying Anomal Text from Images [84.08181780666698]
TextDestroyer is the first training- and annotation-free method for scene text destruction. Our method scrambles text areas in the latent start code using a Gaussian distribution before reconstruction. The advantages of TextDestroyer include: (1) it eliminates labor-intensive data annotation and resource-intensive training; (2) it achieves more thorough text destruction, preventing recognizable traces; and (3) it demonstrates better generalization capabilities, performing well on both real-world scenes and generated images.
arXiv Detail & Related papers (2024-11-01T04:41:00Z)
DP-MLM: Differentially Private Text Rewriting Using Masked Language Models [4.637328271312331]
We propose a new method for differentially private text rewriting based on leveraging masked language models (MLMs) We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only preservations provides better utility at lower $varepsilon$ levels, as compared to previous methods.
arXiv Detail & Related papers (2024-06-30T09:31:01Z)
IDT: Dual-Task Adversarial Attacks for Privacy Protection [8.312362092693377]
Methods to protect privacy can involve using representations inside models that are not to detect sensitive attributes. We propose IDT, a method that analyses predictions made by auxiliary and interpretable models to identify which tokens are important to change. We evaluate different datasets for NLP suitable for different tasks.
arXiv Detail & Related papers (2024-06-28T04:14:35Z)
NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [55.20137833039499]
We suggest sanitizing sensitive text using two common strategies used by humans. We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
arXiv Detail & Related papers (2024-06-06T05:07:44Z)
Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models. We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning. Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z)
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models [63.91178922306669]
We introduce Silent Guardian, a text protection mechanism against large language models (LLMs) By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction. We show that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases.
arXiv Detail & Related papers (2023-12-15T10:30:36Z)
Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection. Our approach achieves better generation quality according to both automatic and human evaluations. Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z)
DP-BART for Privatized Text Rewriting under Local Differential Privacy [2.45626162429986]
We propose a new system 'DP-BART' that largely outperforms existing LDP systems. Our approach uses a novel clipping method, iterative pruning, and further training of internal representations which drastically reduces the amount of noise required for DP guarantees.
arXiv Detail & Related papers (2023-02-15T13:07:34Z)
DP-Rewrite: Towards Reproducibility and Transparency in Differentially Private Text Rewriting [2.465904360857451]
We introduce DP-Rewrite, an open-source framework for differentially private text rewriting. Our system incorporates a variety of downstream datasets, models, pre-training procedures, and evaluation metrics. We provide a set of experiments as a case study on the ADePT DP text rewriting system, detecting a privacy leak in its pre-training approach.
arXiv Detail & Related papers (2022-08-22T15:38:16Z)
Adversarial Watermarking Transformer: Towards Tracing Text Provenance with Data Hiding [80.3811072650087]
We study natural language watermarking as a defense to help better mark and trace the provenance of text. We introduce the Adversarial Watermarking Transformer (AWT) with a jointly trained encoder-decoder and adversarial training. AWT is the first end-to-end model to hide data in text by automatically learning -- without ground truth -- word substitutions along with their locations.
arXiv Detail & Related papers (2020-09-07T11:01:24Z)
Privacy Guarantees for De-identifying Text Transformations [17.636430224292866]
We derive formal privacy guarantees for text transformation-based de-identification methods on the basis of Differential Privacy. We compare a simple redact approach with more sophisticated word-by-word replacement using deep learning models on multiple natural language understanding tasks. We find that only word-by-word replacement is robust against performance drops in various tasks.
arXiv Detail & Related papers (2020-08-07T12:06:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.