InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
- URL: http://arxiv.org/abs/2507.02974v1
- Date: Mon, 30 Jun 2025 18:00:41 GMT
- Title: InvisibleInk: High-Utility and Low-Cost Text Generation with Differential Privacy
- Authors: Vishnu Vinod, Krishna Pillutla, Abhradeep Guha Thakurta,
- Abstract summary: InvisibleInk is a scalable long-form text generation framework satisfying rigorous differential privacy guarantees.<n>We reduce the privacy cost by isolating and clipping only the sensitive information in the model logits.<n>We improve text quality by sampling from a small superset of the top-$k$ private tokens.
- Score: 7.006059299522521
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As major progress in LLM-based long-form text generation enables paradigms such as retrieval-augmented generation (RAG) and inference-time scaling, safely incorporating private information into the generation remains a critical open question. We present InvisibleInk, a highly scalable long-form text generation framework satisfying rigorous differential privacy guarantees with respect to the sensitive references. It interprets sampling from the LLM's next-token-distribution as the exponential mechanism over the LLM logits with two innovations. First, we reduce the privacy cost by isolating and clipping only the sensitive information in the model logits (relative to the public logits). Second, we improve text quality by sampling from a small superset of the top-$k$ private tokens. Empirical evaluations demonstrate a consistent $8\times$ reduction in computation cost over state-of-the-art baselines to generate long-form private text of the same utility across privacy levels. In summary, InvisibleInk is able to generate private long-form text at less than $10\times$ the computation cost of non-private generation.
Related papers
- DP-Fusion: Token-Level Differentially Private Inference for Large Language Models [37.73455762168357]
Large language models (LLMs) can leak sensitive information from their context through generated outputs, either accidentally or when prompted adversarially.<n>We propose DP-Fusion, a token-level Differentially Private Inference (DPI) mechanism that provably bounds how much an LLM's outputs reveal about sensitive tokens in its context.
arXiv Detail & Related papers (2025-07-06T20:49:39Z) - Urania: Differentially Private Insights into AI Use [104.7449031243196]
$Urania$ provides end-to-end privacy protection by leveraging DP tools such as clustering, partition selection, and histogram-based summarization.<n>Results show the framework's ability to extract meaningful conversational insights while maintaining stringent user privacy.
arXiv Detail & Related papers (2025-06-05T07:00:31Z) - Spend Your Budget Wisely: Towards an Intelligent Distribution of the Privacy Budget in Differentially Private Text Rewriting [3.0177210416625124]
We construct and evaluate a toolkit of linguistics- and NLP-based methods used to allocate a privacy budget to constituent tokens in a text document.<n>Our work highlights the intricacies of text privatization with DP, and furthermore, it calls for further work on finding more efficient ways to maximize the privatization benefits offered by DP in text rewriting.
arXiv Detail & Related papers (2025-03-28T12:33:46Z) - Investigating User Perspectives on Differentially Private Text Privatization [81.59631769859004]
This work investigates how factors of $textitscenario$, $textitdata sensitivity$, $textitmechanism type$, and $textitreason for data collection$ impact user preferences for text privatization.<n>We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts.
arXiv Detail & Related papers (2025-03-12T12:33:20Z) - Privacy-Preserving Retrieval-Augmented Generation with Differential Privacy [25.896416088293908]
retrieval-augmented generation (RAG) is particularly effective in assisting large language models (LLMs)<n>RAG outputs risk leaking sensitive information from the external data source.<n>We propose an algorithm that smartly spends privacy budget only for the tokens that require the sensitive information.
arXiv Detail & Related papers (2024-12-06T01:20:16Z) - Private prediction for large-scale synthetic text generation [28.488459921169905]
We present an approach for generating differentially private synthetic text using large language models (LLMs)
In the private prediction framework, we only require the output synthetic data to satisfy differential privacy guarantees.
arXiv Detail & Related papers (2024-07-16T18:28:40Z) - Differentially Private Synthetic Data via Foundation Model APIs 2: Text [56.13240830670327]
A lot of high-quality text data generated in the real world is private and cannot be shared or used freely due to privacy concerns.
We propose an augmented PE algorithm, named Aug-PE, that applies to the complex setting of text.
Our results demonstrate that Aug-PE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines.
arXiv Detail & Related papers (2024-03-04T05:57:50Z) - InferDPT: Privacy-Preserving Inference for Black-box Large Language Model [66.07752875835506]
InferDPT is the first practical framework for the privacy-preserving Inference of black-box LLMs.<n>RANTEXT is a novel differential privacy mechanism integrated into the perturbation module of InferDPT.
arXiv Detail & Related papers (2023-10-18T18:00:11Z) - Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets.
In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z) - Just Fine-tune Twice: Selective Differential Privacy for Large Language
Models [69.66654761324702]
We propose a simple yet effective just-fine-tune-twice privacy mechanism to achieve SDP for large Transformer-based language models.
Experiments show that our models achieve strong performance while staying robust to the canary insertion attack.
arXiv Detail & Related papers (2022-04-15T22:36:55Z) - Differentially Private n-gram Extraction [19.401898070938593]
We revisit the problem of $n$-gram extraction in the differential privacy setting.
In this problem, given a corpus of private text data, the goal is to release as many $n-grams as possible while preserving user level privacy.
We develop a new differentially private algorithm for this problem which, in our experiments, significantly outperforms the state-of-the-art.
arXiv Detail & Related papers (2021-08-05T19:53:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.