Related papers: Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines

Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines

URL: http://arxiv.org/abs/2504.18596v1
Date: Thu, 24 Apr 2025 15:52:53 GMT
Title: Optimizing the Privacy-Utility Balance using Synthetic Data and Configurable Perturbation Pipelines
Authors: Anantha Sharma, Swetha Devabhaktuni, Eklove Mohan,
Abstract summary: This paper explores the strategic use of modern synthetic data generation and advanced data perturbation techniques to enhance security, maintain analytical utility, and improve operational efficiency when managing large datasets.<n>The goal is to create realistic, privacy-preserving datasets that retain high utility for complex machine learning tasks and analytics, a critical need in the data-sensitive industries like BFSI, Healthcare, Retail, and Telecommunications.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: This paper explores the strategic use of modern synthetic data generation and advanced data perturbation techniques to enhance security, maintain analytical utility, and improve operational efficiency when managing large datasets, with a particular focus on the Banking, Financial Services, and Insurance (BFSI) sector. We contrast these advanced methods encompassing generative models like GANs, sophisticated context-aware PII transformation, configurable statistical perturbation, and differential privacy with traditional anonymization approaches. The goal is to create realistic, privacy-preserving datasets that retain high utility for complex machine learning tasks and analytics, a critical need in the data-sensitive industries like BFSI, Healthcare, Retail, and Telecommunications. We discuss how these modern techniques potentially offer significant improvements in balancing privacy preservation while maintaining data utility compared to older methods. Furthermore, we examine the potential for operational gains, such as reduced overhead and accelerated analytics, by using these privacy-enhanced datasets. We also explore key use cases where these methods can mitigate regulatory risks and enable scalable, data-driven innovation without compromising sensitive customer information.

Related papers

Learning More with Less: A Generalizable, Self-Supervised Framework for Privacy-Preserving Capacity Estimation with EV Charging Data [84.37348569981307]
We propose a first-of-its-kind capacity estimation model based on self-supervised pre-training.<n>Our model consistently outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2025-10-05T08:58:35Z)
Privacy-Utility Trade-off in Data Publication: A Bilevel Optimization Framework with Curvature-Guided Perturbation [22.727580097886747]
We introduce a novel bilevel optimization framework for the publication of private datasets.<n>In the upper-level task, a discriminator guides the generation process to ensure that latent variables are mapped to high-quality samples.<n>In the lower-level task, our framework employs local extrinsic curvature on the data manifold as a quantitative measure of individual vulnerability to MIA.
arXiv Detail & Related papers (2025-09-02T07:44:21Z)
Rethinking Data Protection in the (Generative) Artificial Intelligence Era [115.71019708491386]
We propose a four-level taxonomy that captures the diverse protection needs arising in modern (generative) AI models and systems.<n>Our framework offers a structured understanding of the trade-offs between data utility and control, spanning the entire AI pipeline.
arXiv Detail & Related papers (2025-07-03T02:45:51Z)
Privacy-Preserving Analytics for Smart Meter (AMI) Data: A Hybrid Approach to Comply with CPUC Privacy Regulations [0.0]
In California, regulatory decisions mandate strict privacy protections for customer energy usage data.<n>We explore solutions drawn from data anonymization, privacy-preserving machine learning, synthetic data generation, and cryptographic techniques.<n>This allows advanced analytics, including machine learning models, statistical and econometric analysis on energy consumption data, to be performed without compromising individual privacy.
arXiv Detail & Related papers (2025-05-13T05:30:35Z)
Privacy-Preserving Federated Embedding Learning for Localized Retrieval-Augmented Generation [60.81109086640437]
We propose a novel framework called Federated Retrieval-Augmented Generation (FedE4RAG)<n>FedE4RAG facilitates collaborative training of client-side RAG retrieval models.<n>We apply homomorphic encryption within federated learning to safeguard model parameters.
arXiv Detail & Related papers (2025-04-27T04:26:02Z)
Adaptive Clipping for Privacy-Preserving Few-Shot Learning: Enhancing Generalization with Limited Data [12.614480013684759]
We introduce a novel approach called Meta-Clip to enhance the utility of privacy-preserving few-shot learning methods.<n>By dynamically adjusting clipping thresholds during the training process, our Adaptive Clipping method provides fine-grained control over the disclosure of sensitive information.<n>We demonstrate the effectiveness of our approach in minimizing utility degradation, showcasing a superior privacy-preserving trade-off compared to existing privacy-preserving techniques.
arXiv Detail & Related papers (2025-03-27T05:14:18Z)
Differentially Private Federated Learning of Diffusion Models for Synthetic Tabular Data Generation [5.182014186927255]
We introduce DP-Fed-FinDiff framework, a novel integration of Differential Privacy, Federated Learning and Denoising Diffusion Probabilistic Models. We demonstrate the effectiveness of DP-Fed-FinDiff on multiple real-world financial datasets. The results affirm the potential of DP-Fed-FinDiff to enable secure data sharing and robust analytics in highly regulated domains.
arXiv Detail & Related papers (2024-12-20T17:30:58Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Collection, usage and privacy of mobility data in the enterprise and public administrations [55.2480439325792]
Security measures such as anonymization are needed to protect individuals' privacy. Within our study, we conducted expert interviews to gain insights into practices in the field. We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy.
arXiv Detail & Related papers (2024-07-04T08:29:27Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.<n>RAG systems may face severe privacy risks when retrieving private data.<n>We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
MaSS: Multi-attribute Selective Suppression for Utility-preserving Data Transformation from an Information-theoretic Perspective [10.009178591853058]
We propose a formal information-theoretic definition for this utility-preserving privacy protection problem. We design a data-driven learnable data transformation framework that is capable of suppressing sensitive attributes from target datasets. Results demonstrate the effectiveness and generalizability of our method under various configurations.
arXiv Detail & Related papers (2024-05-23T18:35:46Z)
Synergizing Privacy and Utility in Data Analytics Through Advanced Information Theorization [2.28438857884398]
We introduce three sophisticated algorithms: a Noise-Infusion Technique tailored for high-dimensional image data, a Variational Autoencoder (VAE) for robust feature extraction and an Expectation Maximization (EM) approach optimized for structured data privacy. Our methods significantly reduce mutual information between sensitive attributes and transformed data, thereby enhancing privacy. The research contributes to the field by providing a flexible and effective strategy for deploying privacy-preserving algorithms across various data types.
arXiv Detail & Related papers (2024-04-24T22:58:42Z)
FewFedPIT: Towards Privacy-preserving and Few-shot Federated Instruction Tuning [54.26614091429253]
Federated instruction tuning (FedIT) is a promising solution, by consolidating collaborative training across multiple data owners. FedIT encounters limitations such as scarcity of instructional data and risk of exposure to training data extraction attacks. We propose FewFedPIT, designed to simultaneously enhance privacy protection and model performance of federated few-shot learning.
arXiv Detail & Related papers (2024-03-10T08:41:22Z)
Data Collaboration Analysis with Orthonormal Basis Selection and Alignment [2.928964540437144]
We propose textbfOrthonormal DC (ODC), a novel framework that enforces orthonormal constraints during the basis selection and alignment phases.<n>Unlike conventional DC -- which allows arbitrary target bases -- ODC restricts the target to orthonormal bases, rendering the specific choice of basis negligible concerning model performance.
arXiv Detail & Related papers (2024-03-05T08:52:16Z)
TranDRL: A Transformer-Driven Deep Reinforcement Learning Enabled Prescriptive Maintenance Framework [58.474610046294856]
Industrial systems demand reliable predictive maintenance strategies to enhance operational efficiency and reduce downtime. This paper introduces an integrated framework that leverages the capabilities of the Transformer model-based neural networks and deep reinforcement learning (DRL) algorithms to optimize system maintenance actions.
arXiv Detail & Related papers (2023-09-29T02:27:54Z)
Distributed data analytics [8.415530878975751]
Recommendation systems are a key component of online service providers. Financial industry has adopted ML to harness large volumes of data in areas such as fraud detection, risk-management, and compliance.
arXiv Detail & Related papers (2022-03-26T14:10:51Z)
Efficient Logistic Regression with Local Differential Privacy [0.0]
Internet of Things devices are expanding rapidly and generating huge amount of data. There is an increasing need to explore data collected from these devices. Collaborative learning provides a strategic solution for the Internet of Things settings but also raises public concern over data privacy.
arXiv Detail & Related papers (2022-02-05T22:44:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.