DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models
- URL: http://arxiv.org/abs/2508.04208v1
- Date: Wed, 06 Aug 2025 08:43:08 GMT
- Title: DP-DocLDM: Differentially Private Document Image Generation using Latent Diffusion Models
- Authors: Saifullah Saifullah, Stefan Agne, Andreas Dengel, Sheraz Ahmed,
- Abstract summary: We aim to address the challenges within the context of document image classification by substituting real private data with a synthetic counterpart.<n>In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images.<n>We show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets.
- Score: 5.247930659596986
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As deep learning-based, data-driven information extraction systems become increasingly integrated into modern document processing workflows, one primary concern is the risk of malicious leakage of sensitive private data from these systems. While some recent works have explored Differential Privacy (DP) to mitigate these privacy risks, DP-based training is known to cause significant performance degradation and impose several limitations on standard training procedures, making its direct application to downstream tasks both difficult and costly. In this work, we aim to address the above challenges within the context of document image classification by substituting real private data with a synthetic counterpart. In particular, we propose to use conditional latent diffusion models (LDMs) in combination with differential privacy (DP) to generate class-specific synthetic document images under strict privacy constraints, which can then be utilized to train a downstream classifier following standard training procedures. We investigate our approach under various pretraining setups, including unconditional, class-conditional, and layout-conditional pretraining, in combination with multiple private training strategies such as class-conditional and per-label private fine-tuning with DPDM and DP-Promise algorithms. Additionally, we evaluate it on two well-known document benchmark datasets, RVL-CDIP and Tobacco3482, and show that it can generate useful and realistic document samples across various document types and privacy levels ($\varepsilon \in \{1, 5, 10\}$). Lastly, we show that our approach achieves substantial performance improvements in downstream evaluations on small-scale datasets, compared to the direct application of DP-Adam.
Related papers
- Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z) - Differentially Private Relational Learning with Entity-level Privacy Guarantees [17.567309430451616]
This work presents a principled framework for relational learning with formal entity-level DP guarantees.<n>We introduce an adaptive gradient clipping scheme that modulates clipping thresholds based on entity occurrence frequency.<n>These contributions lead to a tailored DP-SGD variant for relational data with provable privacy guarantees.
arXiv Detail & Related papers (2025-06-10T02:03:43Z) - Activity Recognition on Avatar-Anonymized Datasets with Masked Differential Privacy [64.32494202656801]
Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence.<n>We present anonymization pipeline that replaces sensitive human subjects in video datasets with synthetic avatars within context.<n>We also proposeMaskDP to protect non-anonymized but privacy sensitive background information.
arXiv Detail & Related papers (2024-10-22T15:22:53Z) - LLM-based Privacy Data Augmentation Guided by Knowledge Distillation
with a Distribution Tutor for Medical Text Classification [67.92145284679623]
We propose a DP-based tutor that models the noised private distribution and controls samples' generation with a low privacy cost.
We theoretically analyze our model's privacy protection and empirically verify our model.
arXiv Detail & Related papers (2024-02-26T11:52:55Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Synthetic Query Generation for Privacy-Preserving Deep Retrieval Systems using Differentially Private Language Models [21.66239227367523]
We propose an approach that prioritizes ensuring query privacy prior to training a deep retrieval system.
Our method employs DP language models (LMs) to generate private synthetic queries representative of the original data.
arXiv Detail & Related papers (2023-05-10T08:30:31Z) - On the Efficacy of Differentially Private Few-shot Image Classification [40.49270725252068]
In many applications including personalization and federated learning, it is crucial to perform well in the few-shot setting.
We show how the accuracy and vulnerability to attack of few-shot DP image classification models are affected as the number of shots per class, privacy level, model architecture, downstream dataset, and subset of learnable parameters in the model vary.
arXiv Detail & Related papers (2023-02-02T16:16:25Z) - Private Set Generation with Discriminative Information [63.851085173614]
Differentially private data generation is a promising solution to the data privacy challenge.
Existing private generative models are struggling with the utility of synthetic samples.
We introduce a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
arXiv Detail & Related papers (2022-11-07T10:02:55Z) - Differentially Private Diffusion Models [46.46256537222917]
We build on the recent success of diffusion models (DMs) and introduce Differentially Private Diffusion Models (DPDMs)
We propose noise multiplicity, a powerful modification of DP-SGD tailored to the training of DMs.
We validate our novel DPDMs on image generation benchmarks and achieve state-of-the-art performance in all experiments.
arXiv Detail & Related papers (2022-10-18T15:20:47Z) - Differentially private federated deep learning for multi-site medical
image segmentation [56.30543374146002]
Collaborative machine learning techniques such as federated learning (FL) enable the training of models on effectively larger datasets without data transfer.
Recent initiatives have demonstrated that segmentation models trained with FL can achieve performance similar to locally trained models.
However, FL is not a fully privacy-preserving technique and privacy-centred attacks can disclose confidential patient data.
arXiv Detail & Related papers (2021-07-06T12:57:32Z) - On Deep Learning with Label Differential Privacy [54.45348348861426]
We study the multi-class classification setting where the labels are considered sensitive and ought to be protected.
We propose a new algorithm for training deep neural networks with label differential privacy, and run evaluations on several datasets.
arXiv Detail & Related papers (2021-02-11T15:09:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.