Related papers: Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment

Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment

URL: http://arxiv.org/abs/2504.03778v1
Date: Thu, 03 Apr 2025 13:26:59 GMT
Title: Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment
Authors: Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Monica Maria Lucia Sebillo, Giandomenico Solimando,
Abstract summary: Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension.<n>Their application to data archives might facilitate the privatization of sensitive information about the data subjects.<n>This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification.
Score: 3.459382629188014
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension, and their application to data archives might facilitate the privatization of sensitive information about the data subjects. In fact, the information contained in data often includes sensitive and personally identifiable details. This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification. Furthermore, the application of anonymisation techniques, such as k-anonymity, can lead to a significant reduction in the amount of data within data sources, which may reduce the efficacy of predictive processes. In our study, we investigate the capabilities offered by LLMs to enrich anonymized data sources without affecting their anonymity. To this end, we designed new ad-hoc prompt template engineering strategies to perform anonymized Data Augmentation and assess the effectiveness of LLM-based approaches in providing anonymized data. To validate the anonymization guarantees provided by LLMs, we exploited the pyCanon library, designed to assess the values of the parameters associated with the most common privacy-preserving techniques via anonymization. Our experiments conducted on real-world datasets demonstrate that LLMs yield promising results for this goal.

Related papers

Self-Refining Language Model Anonymizers via Adversarial Distillation [49.17383264812234]
Large language models (LLMs) are increasingly used in sensitive domains, where their ability to infer personal data poses emerging privacy risks.<n>We introduce SElf-refining Anonymization with Language model (SEAL), a novel distillation framework for training small language models (SLMs) to perform effective anonymization.
arXiv Detail & Related papers (2025-06-02T08:21:27Z)
Information-Guided Identification of Training Data Imprint in (Proprietary) Large Language Models [52.439289085318634]
We show how to identify training data known to proprietary large language models (LLMs) by using information-guided probes. Our work builds on a key observation: text passages with high surprisal are good search material for memorization probes.
arXiv Detail & Related papers (2025-03-15T10:19:15Z)
SafeSynthDP: Leveraging Large Language Models for Privacy-Preserving Synthetic Data Generation Using Differential Privacy [0.0]
We investigate capability of Large Language Models (Ms) to generate synthetic datasets with Differential Privacy (DP) mechanisms.<n>Our approach incorporates DP-based noise injection methods, including Laplace and Gaussian distributions, into the data generation process.<n>We then evaluate the utility of these DP-enhanced synthetic datasets by comparing the performance of ML models trained on them against models trained on the original data.
arXiv Detail & Related papers (2024-12-30T01:10:10Z)
Evaluating Differentially Private Synthetic Data Generation in High-Stakes Domains [9.123834467375532]
We explore the feasibility of using synthetic data generated from differentially private language models in place of real data to facilitate the development of NLP in high-stakes domains. Our results show that prior simplistic evaluations have failed to highlight utility, privacy, and fairness issues in the synthetic data.
arXiv Detail & Related papers (2024-10-10T19:31:02Z)
Generated Data with Fake Privacy: Hidden Dangers of Fine-tuning Large Language Models on Generated Data [18.984529269623135]
This study investigates whether fine-tuning with generated data truly enhances privacy or introduces additional privacy risks.<n>We use the Pythia Model Suite and Open Pre-trained Transformer to measure privacy risks.
arXiv Detail & Related papers (2024-09-12T10:14:12Z)
LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources. RAG systems may face severe privacy risks when retrieving private data. We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z)
Large Language Models are Advanced Anonymizers [2.9373912230684565]
Recent privacy research on large language models (LLMs) has shown that they achieve near-human-level performance at inferring personal data from online texts.<n>Existing text anonymization methods are currently lacking behind regulatory requirements and adversarial threats.<n>We present a new setting for evaluating anonymization in the face of adversarial LLM inferences.
arXiv Detail & Related papers (2024-02-21T14:44:00Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Releasing survey microdata with exact cluster locations and additional privacy safeguards [77.34726150561087]
We propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards. Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
arXiv Detail & Related papers (2022-05-24T19:37:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.