Related papers: DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

DP-MLM: Differentially Private Text Rewriting Using Masked Language Models

URL: http://arxiv.org/abs/2407.00637v1
Date: Sun, 30 Jun 2024 09:31:01 GMT
Title: DP-MLM: Differentially Private Text Rewriting Using Masked Language Models
Authors: Stephen Meisenbacher, Maulik Chevli, Juraj Vladika, Florian Matthes,
Abstract summary: We propose a new method for differentially private text rewriting based on leveraging masked language models (MLMs) We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only preservations provides better utility at lower $varepsilon$ levels, as compared to previous methods.
Score: 4.637328271312331
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The task of text privatization using Differential Privacy has recently taken the form of $\textit{text rewriting}$, in which an input text is obfuscated via the use of generative (large) language models. While these methods have shown promising results in the ability to preserve privacy, these methods rely on autoregressive models which lack a mechanism to contextualize the private rewriting process. In response to this, we propose $\textbf{DP-MLM}$, a new method for differentially private text rewriting based on leveraging masked language models (MLMs) to rewrite text in a semantically similar $\textit{and}$ obfuscated manner. We accomplish this with a simple contextualization technique, whereby we rewrite a text one token at a time. We find that utilizing encoder-only MLMs provides better utility preservation at lower $\varepsilon$ levels, as compared to previous methods relying on larger models with a decoder. In addition, MLMs allow for greater customization of the rewriting mechanism, as opposed to generative approaches. We make the code for $\textbf{DP-MLM}$ public and reusable, found at https://github.com/sjmeis/DPMLM .

Related papers

InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy [7.006059299522521]
InvisibleInk is a scalable long-form text generation framework satisfying rigorous differential privacy guarantees.<n>We reduce the privacy cost by isolating and clipping only the sensitive information in the model logits.<n>We improve text quality by sampling from a small superset of the top-$k$ private tokens.
arXiv Detail & Related papers (2025-06-30T18:00:41Z)
Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting [3.0177210416625124]
We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document. Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
arXiv Detail & Related papers (2025-03-28T12:33:46Z)
Innamark: A Whitespace Replacement Information-Hiding Method [0.0]
We introduce a novel method for information hiding called Innamark. Innamark can conceal any byte-encoded sequence within a sufficiently long cover text. We propose a specified structure for secret messages that enables compression, encryption, hashing, and error correction.
arXiv Detail & Related papers (2025-02-18T10:21:27Z)
Just Rewrite It Again: A Post-Processing Method for Enhanced Semantic Similarity and Privacy Preservation of Differentially Private Rewritten Text [3.3916160303055567]
We propose a simple post-processing method based on the goal of aligning rewritten texts with their original counterparts. Our results show that such an approach not only produces outputs that are more semantically reminiscent of the original inputs, but also texts which score on average better in empirical privacy evaluations.
arXiv Detail & Related papers (2024-05-30T08:41:33Z)
Generative Text Steganography with Large Language Model [10.572149957139736]
Black-box generative text steganographic method based on user interfaces of large language models, which is called LLM-Stega. We first construct a keyword set and design a new encrypted steganographic mapping to embed secret messages. Comprehensive experiments demonstrate that the proposed LLM-Stega outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-04-16T02:19:28Z)
HU at SemEval-2024 Task 8A: Can Contrastive Learning Learn Embeddings to Detect Machine-Generated Text? [0.0]
This paper describes our system developed for SemEval-2024 Task 8, Multigenerator, Multidomain, and Multilingual Black-Box Machine-Generated Text Detection'' Our key finding is that even without an ensemble of multiple models, a single base model can have comparable performance with the help of data augmentation and contrastive learning.
arXiv Detail & Related papers (2024-02-19T04:11:34Z)
Silent Guardian: Protecting Text from Malicious Exploitation by Large Language Models [63.91178922306669]
We introduce Silent Guardian, a text protection mechanism against large language models (LLMs) By carefully modifying the text to be protected, TPE can induce LLMs to first sample the end token, thus directly terminating the interaction. We show that SG can effectively protect the target text under various configurations and achieve almost 100% protection success rate in some cases.
arXiv Detail & Related papers (2023-12-15T10:30:36Z)
TextDiffuser-2: Unleashing the Power of Language Models for Text Rendering [118.30923824681642]
TextDiffuser-2 aims to unleash the power of language models for text rendering. We utilize the language model within the diffusion model to encode the position and texts at the line level. We conduct extensive experiments and incorporate user studies involving human participants as well as GPT-4V.
arXiv Detail & Related papers (2023-11-28T04:02:40Z)
Text Embeddings Reveal (Almost) As Much As Text [86.5822042193058]
We investigate the problem of embedding textitinversion, reconstructing the full text represented in dense text embeddings. We find that although a na"ive model conditioned on the embedding performs poorly, a multi-step method that iteratively corrects and re-embeds text is able to recover $92%$ of $32text-token$ text inputs exactly.
arXiv Detail & Related papers (2023-10-10T17:39:03Z)
TOPFORMER: Topology-Aware Authorship Attribution of Deepfake Texts with Diverse Writing Styles [14.205559299967423]
Recent advances in Large Language Models (LLMs) have enabled the generation of open-ended high-quality texts, that are non-trivial to distinguish from human-written texts. Users with malicious intent can easily use these open-sourced LLMs to generate harmful texts and dis/misinformation at scale. To mitigate this problem, a computational method to determine if a given text is a deepfake text or not is desired. We propose TopFormer to improve existing AA solutions by capturing more linguistic patterns in deepfake texts.
arXiv Detail & Related papers (2023-09-22T15:32:49Z)
Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection. Our approach achieves better generation quality according to both automatic and human evaluations. Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z)
Representation Deficiency in Masked Language Modeling [107.39136254013042]
We propose MAE-LM, which pretrains the Masked Autoencoder architecture with where $tt[MASK]$ tokens are excluded from the encoder. We show that MAE-LM consistently outperforms pretrained models across different pretraining settings and model sizes when fine-tuned on the GLUE and SQuAD benchmarks.
arXiv Detail & Related papers (2023-02-04T01:54:17Z)
Unsupervised Text Style Transfer with Padded Masked Language Models [25.397832729384064]
Masker is an unsupervised text-editing method for style transfer. It performs competitively in a fully unsupervised setting. It improves supervised methods' accuracy by over 10 percentage points in low-resource settings.
arXiv Detail & Related papers (2020-10-02T15:33:42Z)
PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation. This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus. An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
UniLMv2: Pseudo-Masked Language Models for Unified Language Model Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks. Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.