SoK: Data Minimization in Machine Learning
- URL: http://arxiv.org/abs/2508.10836v1
- Date: Thu, 14 Aug 2025 17:00:13 GMT
- Title: SoK: Data Minimization in Machine Learning
- Authors: Robin Staab, Nikola Jovanović, Kimberly Mai, Prakhar Ganesh, Martin Vechev, Ferdinando Fioretto, Matthew Jagielski,
- Abstract summary: Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task.<n>The relevance of data minimization is particularly pronounced in machine learning (ML) applications.<n>Existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection.<n>This work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization.
- Score: 49.60064304454055
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data minimization (DM) describes the principle of collecting only the data strictly necessary for a given task. It is a foundational principle across major data protection regulations like GDPR and CPRA. Violations of this principle have substantial real-world consequences, with regulatory actions resulting in fines reaching hundreds of millions of dollars. Notably, the relevance of data minimization is particularly pronounced in machine learning (ML) applications, which typically rely on large datasets, resulting in an emerging research area known as Data Minimization in Machine Learning (DMML). At the same time, existing work on other ML privacy and security topics often addresses concerns relevant to DMML without explicitly acknowledging the connection. This disconnect leads to confusion among practitioners, complicating their efforts to implement DM principles and interpret the terminology, metrics, and evaluation criteria used across different research communities. To address this gap, our work introduces a comprehensive framework for DMML, including a unified data pipeline, adversaries, and points of minimization. This framework allows us to systematically review the literature on data minimization and \emph{DM-adjacent} methodologies, for the first time presenting a structured overview designed to help practitioners and researchers effectively apply DM principles. Our work facilitates a unified DM-centric understanding and broader adoption of data minimization strategies in AI/ML.
Related papers
- Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs [66.63911043019294]
Data preparation aims to denoise raw datasets, uncover cross-dataset relationships, and extract valuable insights from them.<n>This paper focuses on the use of LLM techniques to prepare data for diverse downstream tasks.<n>We introduce a task-centric taxonomy that organizes the field into three major tasks: data cleaning, standardization, error processing, imputation, data integration, and data enrichment.
arXiv Detail & Related papers (2026-01-22T12:02:45Z) - PerProb: Indirectly Evaluating Memorization in Large Language Models [13.905375956316632]
We propose PerProb, a label-free framework for indirectly assessing LLM vulnerabilities.<n>PerProb evaluates changes in perplexity and average log probability between data generated by victim and adversary models.<n>We evaluate PerProb's effectiveness across five datasets, revealing varying memory behaviors and privacy risks.
arXiv Detail & Related papers (2025-12-16T17:10:01Z) - Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots.<n>It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing.<n>We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z) - Detecting Training Data of Large Language Models via Expectation Maximization [62.28028046993391]
We introduce EM-MIA, a novel membership inference method that iteratively refines membership scores and prefix scores via an expectation-maximization algorithm.<n> EM-MIA achieves state-of-the-art results on WikiMIA.
arXiv Detail & Related papers (2024-10-10T03:31:16Z) - The trade-off between data minimization and fairness in collaborative filtering [1.8936798735951967]
General Data Protection Regulations aim to safeguard individuals' personal information from harm.
While full compliance is mandatory in the EU, it is not in other places.
This paper studies the relationship between principles of data minimization and fairness in recommender systems.
arXiv Detail & Related papers (2024-09-21T02:32:26Z) - The Data Minimization Principle in Machine Learning [61.17813282782266]
Data minimization aims to reduce the amount of data collected, processed or retained.
It has been endorsed by various global data protection regulations.
However, its practical implementation remains a challenge due to the lack of a rigorous formulation.
arXiv Detail & Related papers (2024-05-29T19:40:27Z) - From Principle to Practice: Vertical Data Minimization for Machine
Learning [15.880586296169687]
Policymakers increasingly demand compliance with the data minimization (DM) principle.
Despite regulatory pressure, the problem of deploying machine learning models that obey DM has so far received little attention.
We propose a novel vertical DM (vDM) workflow based on data generalization.
arXiv Detail & Related papers (2023-11-17T13:01:09Z) - Learning to Limit Data Collection via Scaling Laws: Data Minimization
Compliance in Practice [62.44110411199835]
We build on literature in machine learning law to propose framework for limiting collection based on data interpretation that ties data to system performance.
We formalize a data minimization criterion based on performance curve derivatives and provide an effective and interpretable piecewise power law technique.
arXiv Detail & Related papers (2021-07-16T19:59:01Z) - Operationalizing the Legal Principle of Data Minimization for
Personalization [64.0027026050706]
We identify a lack of a homogeneous interpretation of the data minimization principle and explore two operational definitions applicable in the context of personalization.
We find that the performance decrease incurred by data minimization might not be substantial, but it might disparately impact different users.
arXiv Detail & Related papers (2020-05-28T00:43:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.