A Certified Unlearning Approach without Access to Source Data
- URL: http://arxiv.org/abs/2506.06486v1
- Date: Fri, 06 Jun 2025 19:22:47 GMT
- Title: A Certified Unlearning Approach without Access to Source Data
- Authors: Umit Yigit Basaran, Sk Miraj Ahmed, Amit Roy-Chowdhury, Basak Guler,
- Abstract summary: We propose a certified unlearning framework that enables effective data removal.<n>Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data.<n>Results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
- Score: 4.585544474674649
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing adoption of data privacy regulations, the ability to erase private or copyrighted information from trained models has become a crucial requirement. Traditional unlearning methods often assume access to the complete training dataset, which is unrealistic in scenarios where the source data is no longer available. To address this challenge, we propose a certified unlearning framework that enables effective data removal \final{without access to the original training data samples}. Our approach utilizes a surrogate dataset that approximates the statistical properties of the source data, allowing for controlled noise scaling based on the statistical distance between the two. \updated{While our theoretical guarantees assume knowledge of the exact statistical distance, practical implementations typically approximate this distance, resulting in potentially weaker but still meaningful privacy guarantees.} This ensures strong guarantees on the model's behavior post-unlearning while maintaining its overall utility. We establish theoretical bounds, introduce practical noise calibration techniques, and validate our method through extensive experiments on both synthetic and real-world datasets. The results demonstrate the effectiveness and reliability of our approach in privacy-sensitive settings.
Related papers
- Unlocking Post-hoc Dataset Inference with Synthetic Data [11.886166976507711]
Training datasets are often scraped from the internet without respecting data owners' intellectual property rights.<n>Inference (DI) offers a potential remedy by identifying whether a suspect dataset was used in training.<n>Existing DI methods require a private set-known to be absent from training-that closely matches the compromised dataset's distribution.<n>In this work, we address this challenge by synthetically generating the required held-out set.
arXiv Detail & Related papers (2025-06-18T08:46:59Z) - Privacy-Preserved Automated Scoring using Federated Learning for Educational Research [1.2556373621040728]
We propose a federated learning (FL) framework for automated scoring of educational assessments.<n>We benchmark our model against two state-of-the-art FL methods and a centralized learning baseline.<n>Results show that our model achieves the highest accuracy (94.5%) among FL approaches.
arXiv Detail & Related papers (2025-03-12T19:06:25Z) - Targeted Learning for Data Fairness [52.59573714151884]
We expand fairness inference by evaluating fairness in the data generating process itself.<n>We derive estimators demographic parity, equal opportunity, and conditional mutual information.<n>To validate our approach, we perform several simulations and apply our estimators to real data.
arXiv Detail & Related papers (2025-02-06T18:51:28Z) - Navigating Towards Fairness with Data Selection [27.731128352096555]
We introduce a data selection method designed to efficiently and flexibly mitigate label bias.<n>Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set.<n>Our modality-agnostic method has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.
arXiv Detail & Related papers (2024-12-15T06:11:05Z) - Pseudo-Probability Unlearning: Towards Efficient and Privacy-Preserving Machine Unlearning [59.29849532966454]
We propose PseudoProbability Unlearning (PPU), a novel method that enables models to forget data to adhere to privacy-preserving manner.
Our method achieves over 20% improvements in forgetting error compared to the state-of-the-art.
arXiv Detail & Related papers (2024-11-04T21:27:06Z) - Source-Free Domain-Invariant Performance Prediction [68.39031800809553]
We propose a source-free approach centred on uncertainty-based estimation, using a generative model for calibration in the absence of source data.
Our experiments on benchmark object recognition datasets reveal that existing source-based methods fall short with limited source sample availability.
Our approach significantly outperforms the current state-of-the-art source-free and source-based methods, affirming its effectiveness in domain-invariant performance estimation.
arXiv Detail & Related papers (2024-08-05T03:18:58Z) - Uncertainty for Active Learning on Graphs [70.44714133412592]
Uncertainty Sampling is an Active Learning strategy that aims to improve the data efficiency of machine learning models.<n>We benchmark Uncertainty Sampling beyond predictive uncertainty and highlight a significant performance gap to other Active Learning strategies.<n>We develop ground-truth Bayesian uncertainty estimates in terms of the data generating process and prove their effectiveness in guiding Uncertainty Sampling toward optimal queries.
arXiv Detail & Related papers (2024-05-02T16:50:47Z) - Ungeneralizable Examples [70.76487163068109]
Current approaches to creating unlearnable data involve incorporating small, specially designed noises.
We extend the concept of unlearnable data to conditional data learnability and introduce textbfUntextbfGeneralizable textbfExamples (UGEs)
UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers.
arXiv Detail & Related papers (2024-04-22T09:29:14Z) - Label-Agnostic Forgetting: A Supervision-Free Unlearning in Deep Models [7.742594744641462]
Machine unlearning aims to remove information derived from forgotten data while preserving that of the remaining dataset in a well-trained model.
We propose a supervision-free unlearning approach that operates without the need for labels during the unlearning process.
arXiv Detail & Related papers (2024-03-31T00:29:00Z) - Online Performance Estimation with Unlabeled Data: A Bayesian Application of the Hui-Walter Paradigm [0.0]
We adapt the Hui-Walter paradigm, a method traditionally applied in epidemiology and medicine, to the field of machine learning.
We estimate key performance metrics such as false positive rate, false negative rate, and priors in scenarios where no ground truth is available.
We extend this paradigm for handling online data, opening up new possibilities for dynamic data environments.
arXiv Detail & Related papers (2024-01-17T17:46:10Z) - Self-training via Metric Learning for Source-Free Domain Adaptation of Semantic Segmentation [3.1460691683829825]
Unsupervised source-free domain adaptation methods aim to train a model for the target domain utilizing a pretrained source-domain model and unlabeled target-domain data.
Traditional methods usually use self-training with pseudo-labeling, which is often subjected to thresholding based on prediction confidence.
We propose a novel approach by incorporating a mean-teacher model, wherein the student network is trained using all predictions from the teacher network.
arXiv Detail & Related papers (2022-12-08T12:20:35Z) - Self-Supervised Noisy Label Learning for Source-Free Unsupervised Domain
Adaptation [87.60688582088194]
We propose a novel Self-Supervised Noisy Label Learning method.
Our method can easily achieve state-of-the-art results and surpass other methods by a very large margin.
arXiv Detail & Related papers (2021-02-23T10:51:45Z) - Fair Densities via Boosting the Sufficient Statistics of Exponential
Families [72.34223801798422]
We introduce a boosting algorithm to pre-process data for fairness.
Our approach shifts towards better data fitting while still ensuring a minimal fairness guarantee.
Empirical results are present to display the quality of result on real-world data.
arXiv Detail & Related papers (2020-12-01T00:49:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.