Related papers: Data Provenance via Differential Auditing

Data Provenance via Differential Auditing

URL: http://arxiv.org/abs/2209.01538v1
Date: Sun, 4 Sep 2022 06:02:25 GMT
Title: Data Provenance via Differential Auditing
Authors: Xin Mu, Ming Pang, Feida Zhu
Abstract summary: We introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance. We propose two effective auditing function implementations, an additive one and a multiplicative one. We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.
Score: 5.7962871424710665
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Auditing Data Provenance (ADP), i.e., auditing if a certain piece of data has been used to train a machine learning model, is an important problem in data provenance. The feasibility of the task has been demonstrated by existing auditing techniques, e.g., shadow auditing methods, under certain conditions such as the availability of label information and the knowledge of training protocols for the target model. Unfortunately, both of these conditions are often unavailable in real applications. In this paper, we introduce Data Provenance via Differential Auditing (DPDA), a practical framework for auditing data provenance with a different approach based on statistically significant differentials, i.e., after carefully designed transformation, perturbed input data from the target model's training set would result in much more drastic changes in the output than those from the model's non-training set. This framework allows auditors to distinguish training data from non-training ones without the need of training any shadow models with the help of labeled output data. Furthermore, we propose two effective auditing function implementations, an additive one and a multiplicative one. We report evaluations on real-world data sets demonstrating the effectiveness of our proposed auditing technique.

Related papers

Statistically Testing Training Data for Unwanted Error Patterns using Rule-Oriented Regression [0.5831737970661137]
We provide a method to test training data for flaws, to establish a trustworthy ground-truth for a subsequent training of machine learning models. Our approach extends the abilities of conventional statistical testing by letting the test-condition'' be any condition to describe a pattern in the data. We provide an open source implementation for demonstration and experiments.
arXiv Detail & Related papers (2025-03-24T09:52:36Z)
Privacy-Preserving Model and Preprocessing Verification for Machine Learning [9.4033740844828]
This paper presents a framework for privacy-preserving verification of machine learning models, focusing on models trained on sensitive data. It addresses two key tasks: binary classification, to verify if a target model was trained correctly by applying the appropriate preprocessing steps, and multi-class classification, to identify specific preprocessing errors. Results indicate that although verification accuracy varies across datasets and noise levels, the framework provides effective detection of preprocessing errors, strong privacy guarantees, and practical applicability for safeguarding sensitive data.
arXiv Detail & Related papers (2025-01-14T16:21:54Z)
Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training. We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO. As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
Striving for data-model efficiency: Identifying data externalities on group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance. We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population. Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z)
CAFA: Class-Aware Feature Alignment for Test-Time Adaptation [50.26963784271912]
Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. We propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously encourages a model to learn target representations in a class-discriminative manner.
arXiv Detail & Related papers (2022-06-01T03:02:07Z)
Data-SUITE: Data-centric identification of in-distribution incongruous examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data. We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z)
Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance. We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z)
DIVA: Dataset Derivative of a Learning Task [108.18912044384213]
We present a method to compute the derivative of a learning task with respect to a dataset. A learning task is a function from a training set to the validation error, which can be represented by a trained deep neural network (DNN) The "dataset derivative" is a linear operator, computed around the trained model, that informs how outliers of the weight of each training sample affect the validation error.
arXiv Detail & Related papers (2021-11-18T16:33:12Z)
Self Training with Ensemble of Teacher Models [8.257085583227695]
In order to train robust deep learning models, large amounts of labelled data is required. In the absence of such large repositories of labelled data, unlabeled data can be exploited for the same. Semi-Supervised learning aims to utilize such unlabeled data for training classification models.
arXiv Detail & Related papers (2021-07-17T09:44:09Z)
Data Impressions: Mining Deep Models to Extract Samples for Data-free Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks. We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z)
How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance. We formulate a quality measure for the data set, which we refer to as $rho$-gap. We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.