Faking feature importance: A cautionary tale on the use of
differentially-private synthetic data
- URL: http://arxiv.org/abs/2203.01363v1
- Date: Wed, 2 Mar 2022 19:11:43 GMT
- Title: Faking feature importance: A cautionary tale on the use of
differentially-private synthetic data
- Authors: Oscar Giles, Kasra Hosseini, Grigorios Mingas, Oliver Strickson,
Louise Bowler, Camila Rangel Smith, Harrison Wilde, Jen Ning Lim, Bilal
Mateen, Kasun Amarasinghe, Rayid Ghani, Alison Heppenstall, Nik Lomax, Nick
Malleson, Martin O'Reilly, Sebastian Vollmerteke
- Abstract summary: This paper presents an empirical analysis of the agreement between the feature importance obtained from raw and from synthetic data.
We apply various utility measures to quantify the agreement in feature importance as this varies with the level of privacy.
This work has important implications for developing synthetic versions of highly sensitive data sets in fields such as finance and healthcare.
- Score: 3.631918877491949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Synthetic datasets are often presented as a silver-bullet solution to the
problem of privacy-preserving data publishing. However, for many applications,
synthetic data has been shown to have limited utility when used to train
predictive models. One promising potential application of these data is in the
exploratory phase of the machine learning workflow, which involves
understanding, engineering and selecting features. This phase often involves
considerable time, and depends on the availability of data. There would be
substantial value in synthetic data that permitted these steps to be carried
out while, for example, data access was being negotiated, or with fewer
information governance restrictions. This paper presents an empirical analysis
of the agreement between the feature importance obtained from raw and from
synthetic data, on a range of artificially generated and real-world datasets
(where feature importance represents how useful each feature is when predicting
a the outcome). We employ two differentially-private methods to produce
synthetic data, and apply various utility measures to quantify the agreement in
feature importance as this varies with the level of privacy. Our results
indicate that synthetic data can sometimes preserve several representations of
the ranking of feature importance in simple settings but their performance is
not consistent and depends upon a number of factors. Particular caution should
be exercised in more nuanced real-world settings, where synthetic data can lead
to differences in ranked feature importance that could alter key modelling
decisions. This work has important implications for developing synthetic
versions of highly sensitive data sets in fields such as finance and
healthcare.
Related papers
- Best Practices and Lessons Learned on Synthetic Data [83.63271573197026]
The success of AI models relies on the availability of large, diverse, and high-quality datasets.
Synthetic data has emerged as a promising solution by generating artificial data that mimics real-world patterns.
arXiv Detail & Related papers (2024-04-11T06:34:17Z) - Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic
Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs.
We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z) - Synthetic Data in Healthcare [10.555189948915492]
We present the cases for physical and statistical simulations for creating data and the proposed applications in healthcare and medicine.
We discuss that while synthetics can promote privacy, equity, safety and continual and causal learning, they also run the risk of introducing flaws, blind spots and propagating or exaggerating biases.
arXiv Detail & Related papers (2023-04-06T17:23:39Z) - Synthetic-to-Real Domain Adaptation for Action Recognition: A Dataset and Baseline Performances [76.34037366117234]
We introduce a new dataset called Robot Control Gestures (RoCoG-v2)
The dataset is composed of both real and synthetic videos from seven gesture classes.
We present results using state-of-the-art action recognition and domain adaptation algorithms.
arXiv Detail & Related papers (2023-03-17T23:23:55Z) - Synthetic Data: Methods, Use Cases, and Risks [11.413309528464632]
A possible alternative gaining momentum in both the research community and industry is to share synthetic data instead.
We provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
arXiv Detail & Related papers (2023-03-01T16:35:33Z) - Synthetic Data for Object Classification in Industrial Applications [53.180678723280145]
In object classification, capturing a large number of images per object and in different conditions is not always possible.
This work explores the creation of artificial images using a game engine to cope with limited data in the training dataset.
arXiv Detail & Related papers (2022-12-09T11:43:04Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - Enabling Synthetic Data adoption in regulated domains [1.9512796489908306]
The switch from a Model-Centric to a Data-Centric mindset is putting emphasis on data and its quality rather than algorithms.
In particular, the sensitive nature of the information in highly regulated scenarios needs to be accounted for.
A clever way to bypass such a conundrum relies on Synthetic Data: data obtained from a generative process, learning the real data properties.
arXiv Detail & Related papers (2022-04-13T10:53:54Z) - Bias Mitigated Learning from Differentially Private Synthetic Data: A
Cautionary Tale [13.881022208028751]
Bias can affect all analyses as the synthetic data distribution is an inconsistent estimate of the real-data distribution.
We propose several bias mitigation strategies using privatized likelihood ratios.
We show that bias mitigation provides simple and effective privacy-compliant augmentation for general applications of synthetic data.
arXiv Detail & Related papers (2021-08-24T19:56:44Z) - Measuring Utility and Privacy of Synthetic Genomic Data [3.635321290763711]
We provide the first evaluation of the utility and the privacy protection of five state-of-the-art models for generating synthetic genomic data.
Overall, there is no single approach for generating synthetic genomic data that performs well across the board.
arXiv Detail & Related papers (2021-02-05T17:41:01Z) - A Philosophy of Data [91.3755431537592]
We work from the fundamental properties necessary for statistical computation to a definition of statistical data.
We argue that the need for useful data to be commensurable rules out an understanding of properties as fundamentally unique or equal.
With our increasing reliance on data and data technologies, these two characteristics of data affect our collective conception of reality.
arXiv Detail & Related papers (2020-04-15T14:47:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.