Data Appraisal Without Data Sharing
- URL: http://arxiv.org/abs/2012.06430v1
- Date: Fri, 11 Dec 2020 15:45:19 GMT
- Title: Data Appraisal Without Data Sharing
- Authors: Mimee Xu, Laurens van der Maaten, Awni Hannun
- Abstract summary: We develop methods that do not require data sharing by using secure multi-party computation.
Our experiments show that influence functions provide an appealing trade-off between high-quality appraisal and required computation.
- Score: 28.41079503636652
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the most effective approaches to improving the performance of a
machine-learning model is to acquire additional training data. To do so, a
model owner may seek to acquire relevant training data from a data owner.
Before procuring the data, the model owner needs to appraise the data. However,
the data owner generally does not want to share the data until after an
agreement is reached. The resulting Catch-22 prevents efficient data markets
from forming. To address this problem, we develop data appraisal methods that
do not require data sharing by using secure multi-party computation.
Specifically, we study methods that: (1) compute parameter gradient norms, (2)
perform model fine-tuning, and (3) compute influence functions. Our experiments
show that influence functions provide an appealing trade-off between
high-quality appraisal and required computation.
Related papers
- Capturing the Temporal Dependence of Training Data Influence [100.91355498124527]
We formalize the concept of trajectory-specific leave-one-out influence, which quantifies the impact of removing a data point during training.
We propose data value embedding, a novel technique enabling efficient approximation of trajectory-specific LOO.
As data value embedding captures training data ordering, it offers valuable insights into model training dynamics.
arXiv Detail & Related papers (2024-12-12T18:28:55Z) - Data Acquisition for Improving Model Fairness using Reinforcement Learning [3.3916160303055563]
We focus on the task of acquiring additional labeled data points for training the downstream machine learning model to rapidly improve its fairness.
We present DataSift, a data acquisition framework based on the idea of data valuation that relies on partitioning and multi-armed bandits to determine the most valuable data points to acquire.
We empirically evaluate DataSift on several real-world and synthetic datasets and show that the fairness of a machine learning model can be significantly improved even while acquiring a few data points.
arXiv Detail & Related papers (2024-12-04T03:56:54Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Mendata: A Framework to Purify Manipulated Training Data [12.406255198638064]
We propose Mendata, a framework to purify manipulated training data.
Mendata perturbs the training inputs so that they retain their utility but are distributed similarly to the reference data.
We demonstrate the effectiveness of Mendata by applying it to defeat state-of-the-art data poisoning and data tracing techniques.
arXiv Detail & Related papers (2023-12-03T04:40:08Z) - Fairness-Aware Data Valuation for Supervised Learning [4.874780144224057]
We propose Fairness-Aware Data vauatiOn (FADO) to incorporate fairness concerns into a series of ML-related tasks.
We show how FADO can be applied as the basis for unfairness mitigation pre-processing techniques.
Our methods achieve promising results -- up to a 40 p.p. improvement in fairness at a less than 1 p.p. loss in performance compared to a baseline.
arXiv Detail & Related papers (2023-03-29T18:51:13Z) - Striving for data-model efficiency: Identifying data externalities on
group performance [75.17591306911015]
Building trustworthy, effective, and responsible machine learning systems hinges on understanding how differences in training data and modeling decisions interact to impact predictive performance.
We focus on a particular type of data-model inefficiency, in which adding training data from some sources can actually lower performance evaluated on key sub-groups of the population.
Our results indicate that data-efficiency is a key component of both accurate and trustworthy machine learning.
arXiv Detail & Related papers (2022-11-11T16:48:27Z) - Managing dataset shift by adversarial validation for credit scoring [5.560471251954645]
The inconsistency between the distribution of training data and the data that actually needs to be predicted is likely to cause poor model performance.
We propose a method based on adversarial validation to alleviate the dataset shift problem in credit scoring scenarios.
arXiv Detail & Related papers (2021-12-19T07:07:15Z) - Data Impressions: Mining Deep Models to Extract Samples for Data-free
Applications [26.48630545028405]
"Data Impressions" act as proxy to the training data and can be used to realize a variety of tasks.
We show the applicability of data impressions in solving several computer vision tasks.
arXiv Detail & Related papers (2021-01-15T11:37:29Z) - Toward Understanding the Influence of Individual Clients in Federated
Learning [52.07734799278535]
Federated learning allows clients to jointly train a global model without sending their private data to a central server.
We defined a new notion called em-Influence, quantify this influence over parameters, and proposed an effective efficient model to estimate this metric.
arXiv Detail & Related papers (2020-12-20T14:34:36Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.