Assessing the risk of re-identification arising from an attack on
anonymised data
- URL: http://arxiv.org/abs/2203.16921v1
- Date: Thu, 31 Mar 2022 09:47:05 GMT
- Title: Assessing the risk of re-identification arising from an attack on
anonymised data
- Authors: Anna Antoniou, Giacomo Dossena, Julia MacMillan, Steven Hamblin, David
Clifton, Paula Petrone
- Abstract summary: We calculate the risk of re-identification arising from a malicious attack to an anonymised dataset.
We present an analytical means of estimating the probability of re-identification of a single patient in a k-anonymised dataset.
We generalize this solution to obtain the probability of multiple patients being re-identified.
- Score: 0.24466725954625884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Objective: The use of routinely-acquired medical data for research purposes
requires the protection of patient confidentiality via data anonymisation. The
objective of this work is to calculate the risk of re-identification arising
from a malicious attack to an anonymised dataset, as described below. Methods:
We first present an analytical means of estimating the probability of
re-identification of a single patient in a k-anonymised dataset of Electronic
Health Record (EHR) data. Second, we generalize this solution to obtain the
probability of multiple patients being re-identified. We provide synthetic
validation via Monte Carlo simulations to illustrate the accuracy of the
estimates obtained. Results: The proposed analytical framework for risk
estimation provides re-identification probabilities that are in agreement with
those provided by simulation in a number of scenarios. Our work is limited by
conservative assumptions which inflate the re-identification probability.
Discussion: Our estimates show that the re-identification probability increases
with the proportion of the dataset maliciously obtained and that it has an
inverse relationship with the equivalence class size. Our recursive approach
extends the applicability domain to the general case of a multi-patient
re-identification attack in an arbitrary k-anonymisation scheme. Conclusion: We
prescribe a systematic way to parametrize the k-anonymisation process based on
a pre-determined re-identification probability. We observed that the benefits
of a reduced re-identification risk that come with increasing k-size may not be
worth the reduction in data granularity when one is considering benchmarking
the re-identification probability on the size of the portion of the dataset
maliciously obtained by the adversary.
Related papers
- Data-driven decision-making under uncertainty with entropic risk measure [5.407319151576265]
The entropic risk measure is widely used in high-stakes decision making to account for tail risks associated with an uncertain loss.
To debias the empirical entropic risk estimator, we propose a strongly consistent bootstrapping procedure.
We show that cross validation methods can result in significantly higher out-of-sample risk for the insurer if the bias in validation performance is not corrected for.
arXiv Detail & Related papers (2024-09-30T04:02:52Z) - Risk and cross validation in ridge regression with correlated samples [72.59731158970894]
We provide training examples for the in- and out-of-sample risks of ridge regression when the data points have arbitrary correlations.
We further extend our analysis to the case where the test point has non-trivial correlations with the training set, setting often encountered in time series forecasting.
We validate our theory across a variety of high dimensional data.
arXiv Detail & Related papers (2024-08-08T17:27:29Z) - Geometry-Aware Instrumental Variable Regression [56.16884466478886]
We propose a transport-based IV estimator that takes into account the geometry of the data manifold through data-derivative information.
We provide a simple plug-and-play implementation of our method that performs on par with related estimators in standard settings.
arXiv Detail & Related papers (2024-05-19T17:49:33Z) - Ex-Ante Assessment of Discrimination in Dataset [20.574371560492494]
Data owners face increasing liability for how the use of their data could harm under-priviliged communities.
We propose FORESEE, a FORESt of decision trEEs algorithm, which generates a score that captures how likely an individual's response varies with sensitive attributes.
arXiv Detail & Related papers (2022-08-16T19:28:22Z) - Mitigating multiple descents: A model-agnostic framework for risk
monotonization [84.6382406922369]
We develop a general framework for risk monotonization based on cross-validation.
We propose two data-driven methodologies, namely zero- and one-step, that are akin to bagging and boosting.
arXiv Detail & Related papers (2022-05-25T17:41:40Z) - Clinical Outcome Prediction from Admission Notes using Self-Supervised
Knowledge Integration [55.88616573143478]
Outcome prediction from clinical text can prevent doctors from overlooking possible risks.
Diagnoses at discharge, procedures performed, in-hospital mortality and length-of-stay prediction are four common outcome prediction targets.
We propose clinical outcome pre-training to integrate knowledge about patient outcomes from multiple public sources.
arXiv Detail & Related papers (2021-02-08T10:26:44Z) - Individual dynamic prediction of clinical endpoint from large
dimensional longitudinal biomarker history: a landmark approach [0.0]
We propose a solution for the dynamic prediction of a health event that may exploit repeated measures of a possibly large number of markers.
Our methodology, implemented in R, enables the prediction of an event using the entire longitudinal patient history, even when the number of repeated markers is large.
arXiv Detail & Related papers (2021-02-02T12:36:18Z) - Asymptotic Analysis of an Ensemble of Randomly Projected Linear
Discriminants [94.46276668068327]
In [1], an ensemble of randomly projected linear discriminants is used to classify datasets.
We develop a consistent estimator of the misclassification probability as an alternative to the computationally-costly cross-validation estimator.
We also demonstrate the use of our estimator for tuning the projection dimension on both real and synthetic data.
arXiv Detail & Related papers (2020-04-17T12:47:04Z) - Systematic Evaluation of Privacy Risks of Machine Learning Models [41.017707772150835]
We show that prior work on membership inference attacks may severely underestimate the privacy risks.
We first propose to benchmark membership inference privacy risks by improving existing non-neural network based inference attacks.
We then introduce a new approach for fine-grained privacy analysis by formulating and deriving a new metric called the privacy risk score.
arXiv Detail & Related papers (2020-03-24T00:53:53Z) - Orthogonal Statistical Learning [49.55515683387805]
We provide non-asymptotic excess risk guarantees for statistical learning in a setting where the population risk depends on an unknown nuisance parameter.
We show that if the population risk satisfies a condition called Neymanity, the impact of the nuisance estimation error on the excess risk bound achieved by the meta-algorithm is of second order.
arXiv Detail & Related papers (2019-01-25T02:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.