Related papers: Crowdsourcing on Sensitive Data with Privacy-Preserving Text Rewriting

Crowdsourcing on Sensitive Data with Privacy-Preserving Text Rewriting

URL: http://arxiv.org/abs/2303.03053v1
Date: Mon, 6 Mar 2023 11:54:58 GMT
Title: Crowdsourcing on Sensitive Data with Privacy-Preserving Text Rewriting
Authors: Nina Mouhammad, Johannes Daxenberger, Benjamin Schiller, Ivan Habernal
Abstract summary: Data labeling is often done on crowdsourcing platforms due to scalability reasons. publishing data on public platforms can only be done if no privacy-relevant information is included. We investigate how removing personally identifiable information (PII) as well as applying differential privacy (DP) rewriting can enable text with privacy-relevant information to be used for crowdsourcing.
Score: 9.409281517596396
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Most tasks in NLP require labeled data. Data labeling is often done on crowdsourcing platforms due to scalability reasons. However, publishing data on public platforms can only be done if no privacy-relevant information is included. Textual data often contains sensitive information like person names or locations. In this work, we investigate how removing personally identifiable information (PII) as well as applying differential privacy (DP) rewriting can enable text with privacy-relevant information to be used for crowdsourcing. We find that DP-rewriting before crowdsourcing can preserve privacy while still leading to good label quality for certain tasks and data. PII-removal led to good label quality in all examined tasks, however, there are no privacy guarantees given.

Related papers

Current State in Privacy-Preserving Text Preprocessing for Domain-Agnostic NLP [0.0]
Modern large language models require a huge amount of data to learn linguistic variations.<n>It is possible to extract private information from such language models.<n>This report focuses on a few approaches for domain-agnostic NLP tasks.
arXiv Detail & Related papers (2025-08-05T08:26:45Z)
A False Sense of Privacy: Evaluating Textual Data Sanitization Beyond Surface-level Privacy Leakage [77.83757117924995]
We propose a new framework that evaluates re-identification attacks to quantify individual privacy risks upon data release. Our approach shows that seemingly innocuous auxiliary information can be used to infer sensitive attributes like age or substance use history from sanitized data.
arXiv Detail & Related papers (2025-04-28T01:16:27Z)
Investigating User Perspectives on Differentially Private Text Privatization [81.59631769859004]
This work investigates how factors of $textitscenario$, $textitdata sensitivity$, $textitmechanism type$, and $textitreason for data collection$ impact user preferences for text privatization. We learn that while all these factors play a role in influencing privacy decisions, users are highly sensitive to the utility and coherence of the private output texts.
arXiv Detail & Related papers (2025-03-12T12:33:20Z)
PrivacyLens: Evaluating Privacy Norm Awareness of Language Models in Action [54.11479432110771]
PrivacyLens is a novel framework designed to extend privacy-sensitive seeds into expressive vignettes and further into agent trajectories. We instantiate PrivacyLens with a collection of privacy norms grounded in privacy literature and crowdsourced seeds. State-of-the-art LMs, like GPT-4 and Llama-3-70B, leak sensitive information in 25.68% and 38.69% of cases, even when prompted with privacy-enhancing instructions.
arXiv Detail & Related papers (2024-08-29T17:58:38Z)
Mind the Privacy Unit! User-Level Differential Privacy for Language Model Fine-Tuning [62.224804688233]
differential privacy (DP) offers a promising solution by ensuring models are 'almost indistinguishable' with or without any particular privacy unit. We study user-level DP motivated by applications where it necessary to ensure uniform privacy protection across users.
arXiv Detail & Related papers (2024-06-20T13:54:32Z)
NAP^2: A Benchmark for Naturalness and Privacy-Preserving Text Rewriting by Learning from Human [55.20137833039499]
We suggest sanitizing sensitive text using two common strategies used by humans. We curate the first corpus, coined NAP2, through both crowdsourcing and the use of large language models.
arXiv Detail & Related papers (2024-06-06T05:07:44Z)
Personalized Differential Privacy for Ridge Regression [3.4751583941317166]
We introduce our novel Personalized-DP Output Perturbation method ( PDP-OP) that enables to train Ridge regression models with individual per data point privacy levels. We provide rigorous privacy proofs for our PDP-OP as well as accuracy guarantees for the resulting model. We show that PDP-OP outperforms the personalized privacy techniques of Jorgensen et al.
arXiv Detail & Related papers (2024-01-30T16:00:14Z)
Honesty is the Best Policy: On the Accuracy of Apple Privacy Labels Compared to Apps' Privacy Policies [13.771909487087793]
Apple introduced privacy labels in Dec. 2020 as a way for developers to report the privacy behaviors of their apps. While Apple does not validate labels, they also require developers to provide a privacy policy, which offers an important comparison point. We fine-tuned BERT-based language models to extract privacy policy features for 474,669 apps on the iOS App Store.
arXiv Detail & Related papers (2023-06-29T16:10:18Z)
The Overview of Privacy Labels and their Compatibility with Privacy Policies [24.871967983289117]
Privacy nutrition labels provide a way to understand an app's key data practices without reading the long and hard-to-read privacy policies. Apple and Google have implemented mandates requiring app developers to fill privacy nutrition labels highlighting their privacy practices.
arXiv Detail & Related papers (2023-03-14T20:10:28Z)
How Do Input Attributes Impact the Privacy Loss in Differential Privacy? [55.492422758737575]
We study the connection between the per-subject norm in DP neural networks and individual privacy loss. We introduce a novel metric termed the Privacy Loss-Input Susceptibility (PLIS) which allows one to apportion the subject's privacy loss to their input attributes.
arXiv Detail & Related papers (2022-11-18T11:39:03Z)
Privacy-Aware Crowd Labelling for Machine Learning Tasks [3.6930948691311007]
We propose a privacy preserving text labelling method for varying applications, based in crowdsourcing. We transform text with different levels of privacy, and analyse the effectiveness of the transformation with regards to label correlation and consistency.
arXiv Detail & Related papers (2022-02-03T18:14:45Z)
Privacy Amplification via Shuffling for Linear Contextual Bandits [51.94904361874446]
We study the contextual linear bandit problem with differential privacy (DP) We show that it is possible to achieve a privacy/utility trade-off between JDP and LDP by leveraging the shuffle model of privacy. Our result shows that it is possible to obtain a tradeoff between JDP and LDP by leveraging the shuffle model while preserving local privacy.
arXiv Detail & Related papers (2021-12-11T15:23:28Z)
BeeTrace: A Unified Platform for Secure Contact Tracing that Breaks Data Silos [73.84437456144994]
Contact tracing is an important method to control the spread of an infectious disease such as COVID-19. Current solutions do not utilize the huge volume of data stored in business databases and individual digital devices. We propose BeeTrace, a unified platform that breaks data silos and deploys state-of-the-art cryptographic protocols to guarantee privacy goals.
arXiv Detail & Related papers (2020-07-05T10:33:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.