A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
- URL: http://arxiv.org/abs/2504.21035v1
- Date: Mon, 28 Apr 2025 01:16:27 GMT
- Title: A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage
- Authors: Rui Xin, Niloofar Mireshghallah, Shuyue Stella Li, Michael Duan, Hyunwoo Kim, Yejin Choi, Yulia Tsvetkov, Sewoong Oh, Pang Wei Koh,
- Abstract summary: We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release.<n>Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
- Score: 77.83757117924995
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Sanitizing sensitive text data typically involves removing personally identifiable information (PII) or generating synthetic data under the assumption that these methods adequately protect privacy; however, their effectiveness is often only assessed by measuring the leakage of explicit identifiers but ignoring nuanced textual markers that can lead to re-identification. We challenge the above illusion of privacy by proposing a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information -- such as routine social activities -- can be used to infer sensitive attributes like age or substance use history from sanitized data. For instance, we demonstrate that Azure's commercial PII removal tool fails to protect 74\% of information in the MedQA dataset. Although differential privacy mitigates these risks to some extent, it significantly reduces the utility of the sanitized text for downstream tasks. Our findings indicate that current sanitization techniques offer a \textit{false sense of privacy}, highlighting the need for more robust methods that protect against semantic-level information leakage.
Related papers
- Investigating Vulnerabilities of GPS Trip Data to Trajectory-User Linking Attacks [49.1574468325115]
We propose a novel attack to reconstruct user identifiers in GPS trip datasets consisting of single trips.<n>We show that the risk of re-identification is significant even when personal identifiers have been removed.<n>Further investigations indicate that users who frequently visit locations that are only visited by a small number of others tend to be more vulnerable to re-identification.
arXiv Detail & Related papers (2025-02-12T08:54:49Z) - Enforcing Demographic Coherence: A Harms Aware Framework for Reasoning about Private Data Release [14.939460540040459]
We introduce demographic coherence, a condition inspired by privacy attacks that we argue is necessary for data privacy.<n>Our framework focuses on confidence rated predictors, which can in turn be distilled from almost any data-informed process.<n>We prove that every differentially private data release is also demographically coherent, and that there are demographically coherent algorithms which are not differentially private.
arXiv Detail & Related papers (2025-02-04T20:42:30Z) - Activity Recognition on Avatar-Anonymized Datasets with Masked Differential Privacy [64.32494202656801]
Privacy-preserving computer vision is an important emerging problem in machine learning and artificial intelligence.<n>We present anonymization pipeline that replaces sensitive human subjects in video datasets with synthetic avatars within context.<n>We also proposeMaskDP to protect non-anonymized but privacy sensitive background information.
arXiv Detail & Related papers (2024-10-22T15:22:53Z) - A Summary of Privacy-Preserving Data Publishing in the Local Setting [0.6749750044497732]
Statistical Disclosure Control aims to minimize the risk of exposing confidential information by de-identifying it.
We outline the current privacy-preserving techniques employed in microdata de-identification, delve into privacy measures tailored for various disclosure scenarios, and assess metrics for information loss and predictive performance.
arXiv Detail & Related papers (2023-12-19T04:23:23Z) - Diff-Privacy: Diffusion-based Face Privacy Protection [58.1021066224765]
In this paper, we propose a novel face privacy protection method based on diffusion models, dubbed Diff-Privacy.
Specifically, we train our proposed multi-scale image inversion module (MSI) to obtain a set of SDM format conditional embeddings of the original image.
Based on the conditional embeddings, we design corresponding embedding scheduling strategies and construct different energy functions during the denoising process to achieve anonymization and visual identity information hiding.
arXiv Detail & Related papers (2023-09-11T09:26:07Z) - Slice it up: Unmasking User Identities in Smartwatch Health Data [1.4797368693230672]
We introduce a novel framework for similarity-based Dynamic Time Warping (DTW) re-identification attacks on time series health data.<n>Our attack is independent of training data and computes similarity rankings in about 2 minutes for 10,000 subjects on a single CPU core.
arXiv Detail & Related papers (2023-08-16T12:14:50Z) - How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss.
We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z) - Hiding Visual Information via Obfuscating Adversarial Perturbations [47.315523613407244]
We propose an adversarial visual information hiding method to protect the visual privacy of data.
Specifically, the method generates obfuscating adversarial perturbations to obscure the visual information of the data.
Experimental results on the recognition and classification tasks demonstrate that the proposed method can effectively hide visual information.
arXiv Detail & Related papers (2022-09-30T08:23:26Z) - Decouple-and-Sample: Protecting sensitive information in task agnostic
data release [17.398889291769986]
sanitizer is a framework for secure and task-agnostic data release.
We show that a better privacy-utility trade-off is achieved if sensitive information can be synthesized privately.
arXiv Detail & Related papers (2022-03-17T19:15:33Z) - Semantics-Preserved Distortion for Personal Privacy Protection in Information Management [65.08939490413037]
This paper suggests a linguistically-grounded approach to distort texts while maintaining semantic integrity.
We present two distinct frameworks for semantic-preserving distortion: a generative approach and a substitutive approach.
We also explore privacy protection in a specific medical information management scenario, showing our method effectively limits sensitive data memorization.
arXiv Detail & Related papers (2022-01-04T04:01:05Z) - Subverting Privacy-Preserving GANs: Hiding Secrets in Sanitized Images [13.690485523871855]
State-of-the-art approaches use privacy-preserving generative adversarial networks (PP-GANs) to enable reliable facial expression recognition without leaking users' identity.
We show that it is possible to hide the sensitive identification data in the sanitized output images of such PP-GANs for later extraction.
arXiv Detail & Related papers (2020-09-19T19:02:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.