Related papers: SynBench: A Benchmark for Differentially Private Text Generation

SynBench: A Benchmark for Differentially Private Text Generation

URL: http://arxiv.org/abs/2509.14594v1
Date: Thu, 18 Sep 2025 03:57:50 GMT
Title: SynBench: A Benchmark for Differentially Private Text Generation
Authors: Yidan Sun, Viktor Schlegel, Srinivasan Nandakumar, Iqra Zahid, Yuping Wu, Yulong Wu, Hao Li, Jie Zhang, Warren Del-Pinto, Goran Nenadic, Siew Kei Lam, Anil Anthony Bharath,
Abstract summary: Data-driven decision support in high-stakes domains like healthcare and finance faces significant barriers to data sharing.<n>Recent generative AI models, such as large language models, have shown impressive performance in open-domain tasks.<n>But their adoption in sensitive environments remains limited by unpredictable behaviors and insufficient privacy-preserving datasets.
Score: 35.908455649647784
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data-driven decision support in high-stakes domains like healthcare and finance faces significant barriers to data sharing due to regulatory, institutional, and privacy concerns. While recent generative AI models, such as large language models, have shown impressive performance in open-domain tasks, their adoption in sensitive environments remains limited by unpredictable behaviors and insufficient privacy-preserving datasets for benchmarking. Existing anonymization methods are often inadequate, especially for unstructured text, as redaction and masking can still allow re-identification. Differential Privacy (DP) offers a principled alternative, enabling the generation of synthetic data with formal privacy assurances. In this work, we address these challenges through three key contributions. First, we introduce a comprehensive evaluation framework with standardized utility and fidelity metrics, encompassing nine curated datasets that capture domain-specific complexities such as technical jargon, long-context dependencies, and specialized document structures. Second, we conduct a large-scale empirical study benchmarking state-of-the-art DP text generation methods and LLMs of varying sizes and different fine-tuning strategies, revealing that high-quality domain-specific synthetic data generation under DP constraints remains an unsolved challenge, with performance degrading as domain complexity increases. Third, we develop a membership inference attack (MIA) methodology tailored for synthetic text, providing first empirical evidence that the use of public datasets - potentially present in pre-training corpora - can invalidate claimed privacy guarantees. Our findings underscore the urgent need for rigorous privacy auditing and highlight persistent gaps between open-domain and specialist evaluations, informing responsible deployment of generative AI in privacy-sensitive, high-stakes settings.

Related papers

A Late-Fusion Multimodal AI Framework for Privacy-Preserving Deduplication in National Healthcare Data Environments [0.0]
I propose a novel, scalable, multimodal AI framework for detecting duplicates without depending on sensitive information.<n>This approach offers a privacy-compliant solution to entity resolution, supports secure digital infrastructure, and enhances the reliability of public health analytics.<n>It is well-suited for integration into national health data modernization efforts, aligning with broader goals of privacy-first innovation.
arXiv Detail & Related papers (2026-03-04T20:46:26Z)
Agentic Adversarial QA for Improving Domain-Specific LLMs [53.00642389531106]
Large Language Models (LLMs) often struggle to adapt effectively to specialized domains.<n>We propose an adversarial question-generation framework that produces a compact set of semantically challenging questions.
arXiv Detail & Related papers (2026-02-20T10:53:09Z)
Rethinking Anonymity Claims in Synthetic Data Generation: A Model-Centric Privacy Attack Perspective [18.404146545866812]
Training generative machine learning models to produce synthetic data has become a popular approach for enhancing privacy in data sharing.<n>As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic anonymity can still pose privacy risks.<n>We argue that meaningful assessments must account for the capabilities and properties of underlying generative model and be grounded in state-of-the-art privacy attacks.
arXiv Detail & Related papers (2026-01-30T00:57:41Z)
How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
On the MIA Vulnerability Gap Between Private GANs and Diffusion Models [51.53790101362898]
Generative Adversarial Networks (GANs) and diffusion models have emerged as leading approaches for high-quality image synthesis.<n>We present the first unified theoretical and empirical analysis of the privacy risks faced by differentially private generative models.
arXiv Detail & Related papers (2025-09-03T14:18:22Z)
Evaluating Differentially Private Generation of Domain-Specific Text [33.72321050465059]
We introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under Differential Privacy guarantees.<n>We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets.
arXiv Detail & Related papers (2025-08-28T05:57:47Z)
RL-Finetuned LLMs for Privacy-Preserving Synthetic Rewriting [17.294176570269]
We propose a reinforcement learning framework that fine-tunes a large language model (LLM) using a composite reward function.<n>The privacy reward combines semantic cues with structural patterns derived from a minimum spanning tree (MST) over latent representations.<n> Empirical results show that the proposed method significantly enhances author obfuscation and privacy metrics without degrading semantic quality.
arXiv Detail & Related papers (2025-08-25T04:38:19Z)
Generating Synthetic Data with Formal Privacy Guarantees: State of the Art and the Road Ahead [7.410975558116122]
Privacy-preserving synthetic data offers a promising solution to harness segregated data in high-stakes domains.<n>This survey presents the theoretical foundations of generative models and differential privacy followed by a review of state-of-the-art methods.<n>Our findings highlight key challenges including unaccounted privacy leakage, insufficient empirical verification of formal guarantees, and a critical deficit of realistic benchmarks.
arXiv Detail & Related papers (2025-03-26T16:06:33Z)
Privacy in Fine-tuning Large Language Models: Attacks, Defenses, and Future Directions [11.338466798715906]
Fine-tuning Large Language Models (LLMs) can achieve state-of-the-art performance across various domains.<n>This paper provides a comprehensive survey of privacy challenges associated with fine-tuning LLMs.<n>We highlight vulnerabilities to various privacy attacks, including membership inference, data extraction, and backdoor attacks.
arXiv Detail & Related papers (2024-12-21T06:41:29Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Anonymizing text that contains sensitive information is crucial for a wide range of applications.<n>Existing techniques face the emerging challenges of the re-identification ability of large language models.<n>We propose a framework composed of three key components: a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.