Data Provenance via Differential Auditing
- URL: http://arxiv.org/abs/2209.01538v1
- Date: Sun, 4 Sep 2022 06:02:25 GMT
- Title: Data Provenance via Differential Auditing
- Authors: Xin Mu, Ming Pang, Feida Zhu
- Abstract summary: We introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance.
We propose two effective auditing function implementations, an additive one and a multiplicative one.
We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.
- Score: 5.7962871424710665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has
been used to train a machine learning model, is an important problem in data
provenance. The feasibility of the task has been demonstrated by existing
auditing techniques, e.g., shadow auditing methods, under certain conditions
such as the availability of label information and the knowledge of training
protocols for the target model. Unfortunately, both of these conditions are
often unavailable in real applications. In this paper, we introduce Data
Provenance via Differential Auditing (DPDA), a practical framework for auditing
data provenance with a different approach based on statistically significant
differentials, i.e., after carefully designed transformation, perturbed input
data from the target model's training set would result in much more drastic
changes in the output than those from the model's non-training set. This
framework allows auditors to distinguish training data from non-training ones
without the need of training any shadow models with the help of labeled output
data. Furthermore, we propose two effective auditing function implementations,
an additive one and a multiplicative one. We report evaluations on real-world
data sets demonstrating the effectiveness of our proposed auditing technique.
Related papers
- Privacy-Preserving Model and Preprocessing Verification for Machine Learning [9.4033740844828]
This paper presents a framework for privacy-preserving verification of machine learning models, focusing on models trained on sensitive data.
It addresses two key tasks: binary classification, to verify if a target model was trained correctly by applying the appropriate preprocessing steps, and multi-class classification, to identify specific preprocessing errors.
Results indicate that although verification accuracy varies across datasets and noise levels, the framework provides effective detection of preprocessing errors, strong privacy guarantees, and practical applicability for safeguarding sensitive data.
arXiv Detail & Related papers (2025-01-14T16:21:54Z) - Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance.
We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z) - DIVA: Dataset Derivative of a Learning Task [108.18912044384213]
We present a method to compute the derivative of a learning task with respect to a dataset.
A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN)
The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
arXiv Detail & Related papers (2021-11-18T16:33:12Z) - Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required.
In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same.
Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.