Related papers: A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset

URL: http://arxiv.org/abs/2506.17185v1
Date: Fri, 20 Jun 2025 17:40:05 GMT
Title: A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset
Authors: Rachel Hong, Jevan Hutson, William Agnew, Imaad Huda, Tadayoshi Kohno, Jamie Morgenstern,
Abstract summary: We ask: What are the legal privacy implications of web-scraped machine learning datasets?<n>In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts.
Score: 12.094673476388639
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We investigate the contents of web-scraped data for training AI systems, at sizes where human dataset curators and compilers no longer manually annotate every sample. Building off of prior privacy concerns in machine learning models, we ask: What are the legal privacy implications of web-scraped machine learning datasets? In an empirical study of a popular training dataset, we find significant presence of personally identifiable information despite sanitization efforts. Our audit provides concrete evidence to support the concern that any large-scale web-scraped dataset may contain personal data. We use these findings of a real-world dataset to inform our legal analysis with respect to existing privacy and data protection laws. We surface various privacy risks of current data curation practices that may propagate personal information to downstream models. From our findings, we argue for reorientation of current frameworks of "publicly available" information to meaningfully limit the development of AI built upon indiscriminate scraping of the internet.

Related papers

How to DP-fy Your Data: A Practical Guide to Generating Synthetic Data With Differential Privacy [52.00934156883483]
Differential Privacy (DP) is a framework for reasoning about and limiting information leakage.<n>Differentially Private Synthetic data refers to synthetic data that preserves the overall trends of source data.
arXiv Detail & Related papers (2025-12-02T21:14:39Z)
FT-PrivacyScore: Personalized Privacy Scoring Service for Machine Learning Participation [4.772368796656325]
In practice, controlled data access remains a mainstream method for protecting data privacy in many industrial and research environments. We developed the demo prototype FT-PrivacyScore to show that it's possible to efficiently and quantitatively estimate the privacy risk of participating in a model fine-tuning task.
arXiv Detail & Related papers (2024-10-30T02:41:26Z)
Collection, usage and privacy of mobility data in the enterprise and public administrations [55.2480439325792]
Security measures such as anonymization are needed to protect individuals' privacy. Within our study, we conducted expert interviews to gain insights into practices in the field. We survey privacy-enhancing methods in use, which generally do not comply with state-of-the-art standards of differential privacy.
arXiv Detail & Related papers (2024-07-04T08:29:27Z)
Where you go is who you are -- A study on machine learning based semantic privacy attacks [3.259843027596329]
We present a systematic analysis of two attack scenarios, namely location categorization and user profiling. Experiments on the Foursquare dataset and tracking data demonstrate the potential for abuse of high-quality spatial information. Our findings point out the risks of ever-growing databases of tracking data and spatial context data.
arXiv Detail & Related papers (2023-10-26T17:56:50Z)
A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing. Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data. Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z)
Protecting User Privacy in Online Settings via Supervised Learning [69.38374877559423]
We design an intelligent approach to online privacy protection that leverages supervised learning. By detecting and blocking data collection that might infringe on a user's privacy, we can restore a degree of digital privacy to the user.
arXiv Detail & Related papers (2023-04-06T05:20:16Z)
Position: Considerations for Differentially Private Learning with Large-Scale Public Pretraining [75.25943383604266]
We question whether the use of large Web-scraped datasets should be viewed as differential-privacy-preserving. We caution that publicizing these models pretrained on Web data as "private" could lead to harm and erode the public's trust in differential privacy as a meaningful definition of privacy. We conclude by discussing potential paths forward for the field of private learning, as public pretraining becomes more popular and powerful.
arXiv Detail & Related papers (2022-12-13T10:41:12Z)
Certified Data Removal in Sum-Product Networks [78.27542864367821]
Deleting the collected data is often insufficient to guarantee data privacy. UnlearnSPN is an algorithm that removes the influence of single data points from a trained sum-product network.
arXiv Detail & Related papers (2022-10-04T08:22:37Z)
PRIVEE: A Visual Analytic Workflow for Proactive Privacy Risk Inspection of Open Data [3.2136309934080867]
Open data sets that contain personal information are susceptible to adversarial attacks even when anonymized. We develop a visual analytic solution that enables data defenders to gain awareness about the disclosure risks in local, joinable data neighborhoods. We use this problem and domain characterization to develop a set of visual analytic interventions as a defense mechanism.
arXiv Detail & Related papers (2022-08-12T19:57:09Z)
Survey: Leakage and Privacy at Inference Time [59.957056214792665]
Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance. We focus on inference-time leakage, as the most likely scenario for publicly available models. We propose a taxonomy across involuntary and malevolent leakage, available defences, followed by the currently available assessment metrics and applications.
arXiv Detail & Related papers (2021-07-04T12:59:16Z)
Security and Privacy Preserving Deep Learning [2.322461721824713]
Massive data collection required for deep learning presents obvious privacy issues. Users personal, highly sensitive data such as photos and voice recordings are kept indefinitely by the companies that collect it. Deep neural networks are susceptible to various inference attacks as they remember information about their training data.
arXiv Detail & Related papers (2020-06-23T01:53:46Z)
Privacy in Deep Learning: A Survey [16.278779275923448]
The ever-growing advances of deep learning in many areas have led to the adoption of Deep Neural Networks (DNNs) in production systems. The availability of large datasets and high computational power are the main contributors to these advances. This poses serious privacy concerns as this data can be misused or leaked through various vulnerabilities.
arXiv Detail & Related papers (2020-04-25T23:47:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.