Data Provenance via Differential Auditing
- URL: http://arxiv.org/abs/2209.01538v1
- Date: Sun, 4 Sep 2022 06:02:25 GMT
- Title: Data Provenance via Differential Auditing
- Authors: Xin Mu, Ming Pang, Feida Zhu
- Abstract summary: We introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance.
We propose two effective auditing function implementations, an additive one and a multiplicative one.
We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.
- Score: 5.7962871424710665
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has
been used to train a machine learning model, is an important problem in data
provenance. The feasibility of the task has been demonstrated by existing
auditing techniques, e.g., shadow auditing methods, under certain conditions
such as the availability of label information and the knowledge of training
protocols for the target model. Unfortunately, both of these conditions are
often unavailable in real applications. In this paper, we introduce Data
Provenance via Differential Auditing (DPDA), a practical framework for auditing
data provenance with a different approach based on statistically significant
differentials, i.e., after carefully designed transformation, perturbed input
data from the target model's training set would result in much more drastic
changes in the output than those from the model's non-training set. This
framework allows auditors to distinguish training data from non-training ones
without the need of training any shadow models with the help of labeled output
data. Furthermore, we propose two effective auditing function implementations,
an additive one and a multiplicative one. We report evaluations on real-world
data sets demonstrating the effectiveness of our proposed auditing technique.
Related papers
- Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time.
We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance.
We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z) - DIVA: Dataset Derivative of a Learning Task [108.18912044384213]
We present a method to compute the derivative of a learning task with respect to a dataset.
A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN)
The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
arXiv Detail & Related papers (2021-11-18T16:33:12Z) - Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required.
In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same.
Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z) - Data Impressions: Mining Deep Models to Extract Samples for Data-free
Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks.
We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.