Have you forgotten? A method to assess if machine learning models have
forgotten data
- URL: http://arxiv.org/abs/2004.10129v2
- Date: Sun, 12 Jul 2020 12:50:14 GMT
- Title: Have you forgotten? A method to assess if machine learning models have
forgotten data
- Authors: Xiao Liu, Sotirios A Tsaftaris
- Abstract summary: In the era of deep learning, aggregation of data from several sources is a common approach to ensuring data diversity.
In this paper, we want to address the challenging question of whether data have been forgotten by a model.
We establish statistical methods that compare the target's outputs with outputs of models trained with different datasets.
- Score: 20.9131206112401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of deep learning, aggregation of data from several sources is a
common approach to ensuring data diversity. Let us consider a scenario where
several providers contribute data to a consortium for the joint development of
a classification model (hereafter the target model), but, now one of the
providers decides to leave. This provider requests that their data (hereafter
the query dataset) be removed from the databases but also that the model
`forgets' their data. In this paper, for the first time, we want to address the
challenging question of whether data have been forgotten by a model. We assume
knowledge of the query dataset and the distribution of a model's output. We
establish statistical methods that compare the target's outputs with outputs of
models trained with different datasets. We evaluate our approach on several
benchmark datasets (MNIST, CIFAR-10 and SVHN) and on a cardiac pathology
diagnosis task using data from the Automated Cardiac Diagnosis Challenge
(ACDC). We hope to encourage studies on what information a model retains and
inspire extensions in more complex settings.
Related papers
- Federated Data Model [16.62770246342126]
In artificial intelligence (AI), especially deep learning, data diversity and volume play a pivotal role in model development.
We developed a method called the Federated Data Model (FDM) to train robust deep learning models across different locations.
Our results show that models trained with this method perform well both on the data they were originally trained on and on data from other sites.
arXiv Detail & Related papers (2024-03-13T18:16:54Z) - Designing Data: Proactive Data Collection and Iteration for Machine
Learning [12.295169687537395]
Lack of diversity in data collection has caused significant failures in machine learning (ML) applications.
New methods to track & manage data collection, iteration, and model training are necessary for evaluating whether datasets reflect real world variability.
arXiv Detail & Related papers (2023-01-24T21:40:29Z) - Data Distillation: A Survey [32.718297871027865]
Deep learning has led to the curation of a vast number of massive and multifarious datasets.
Despite having close-to-human performance on individual tasks, training parameter-hungry models on large datasets poses multi-faceted problems.
Data distillation approaches aim to synthesize terse data summaries, which can serve as effective drop-in replacements of the original dataset.
arXiv Detail & Related papers (2023-01-11T02:25:10Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Learning from aggregated data with a maximum entropy model [73.63512438583375]
We show how a new model, similar to a logistic regression, may be learned from aggregated data only by approximating the unobserved feature distribution with a maximum entropy hypothesis.
We present empirical evidence on several public datasets that the model learned this way can achieve performances comparable to those of a logistic model trained with the full unaggregated data.
arXiv Detail & Related papers (2022-10-05T09:17:27Z) - Application of Federated Learning in Building a Robust COVID-19 Chest
X-ray Classification Model [0.0]
Federated Learning (FL) helps AI models to generalize better without moving all the data to a central server.
We trained a deep learning model to solve a binary classification problem of predicting the presence or absence of COVID-19.
arXiv Detail & Related papers (2022-04-22T05:21:50Z) - A Unified Deep Model of Learning from both Data and Queries for
Cardinality Estimation [28.570086492742035]
We propose a new unified deep autoregressive model, UAE, that learns the joint data distribution from both the data and query workload.
UAE achieves single-digit multiplicative error at tail, better accuracies over state-of-the-art methods, and is both space and time efficient.
arXiv Detail & Related papers (2021-07-26T16:09:58Z) - On the Efficacy of Adversarial Data Collection for Question Answering:
Results from a Large-Scale Randomized Study [65.17429512679695]
In adversarial data collection (ADC), a human workforce interacts with a model in real time, attempting to produce examples that elicit incorrect predictions.
Despite ADC's intuitive appeal, it remains unclear when training on adversarial datasets produces more robust models.
arXiv Detail & Related papers (2021-06-02T00:48:33Z) - Decentralized Federated Learning Preserves Model and Data Privacy [77.454688257702]
We propose a fully decentralized approach, which allows to share knowledge between trained models.
Students are trained on the output of their teachers via synthetically generated input data.
The results show that an untrained student model, trained on the teachers output reaches comparable F1-scores as the teacher.
arXiv Detail & Related papers (2021-02-01T14:38:54Z) - DeGAN : Data-Enriching GAN for Retrieving Representative Samples from a
Trained Classifier [58.979104709647295]
We bridge the gap between the abundance of available data and lack of relevant data, for the future learning tasks of a trained network.
We use the available data, that may be an imbalanced subset of the original training dataset, or a related domain dataset, to retrieve representative samples.
We demonstrate that data from a related domain can be leveraged to achieve state-of-the-art performance.
arXiv Detail & Related papers (2019-12-27T02:05:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.