Comparison of machine learning models applied on anonymized data with
different techniques
- URL: http://arxiv.org/abs/2305.07415v1
- Date: Fri, 12 May 2023 12:34:07 GMT
- Title: Comparison of machine learning models applied on anonymized data with
different techniques
- Authors: Judith S\'ainz-Pardo D\'iaz and \'Alvaro L\'opez Garc\'ia
- Abstract summary: We study four classical machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them.
The performance of these models is studied when varying the value of k for k-anonymity and additional tools such as $ell$-diversity, t-closeness and $delta$-disclosure privacy are also deployed on the well-known adult dataset.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Anonymization techniques based on obfuscating the quasi-identifiers by means
of value generalization hierarchies are widely used to achieve preset levels of
privacy. To prevent different types of attacks against database privacy it is
necessary to apply several anonymization techniques beyond the classical
k-anonymity or $\ell$-diversity. However, the application of these methods is
directly connected to a reduction of their utility in prediction and decision
making tasks. In this work we study four classical machine learning methods
currently used for classification purposes in order to analyze the results as a
function of the anonymization techniques applied and the parameters selected
for each of them. The performance of these models is studied when varying the
value of k for k-anonymity and additional tools such as $\ell$-diversity,
t-closeness and $\delta$-disclosure privacy are also deployed on the well-known
adult dataset.
Related papers
- Data Lineage Inference: Uncovering Privacy Vulnerabilities of Dataset Pruning [31.888075470799908]
We show that even if data in a redundant set is solely used before model training, its pruning-phase membership status can still be detected through attacks.
We introduce a new task called Data-Centric Membership Inference and propose the first ever data-centric privacy inference paradigm named Data Lineage Inference.
We find that different pruning methods involve varying levels of privacy leakage, and even the same pruning method can present different privacy risks at different pruning fractions.
arXiv Detail & Related papers (2024-11-24T11:46:59Z) - Masked Differential Privacy [64.32494202656801]
We propose an effective approach called masked differential privacy (DP), which allows for controlling sensitive regions where differential privacy is applied.
Our method operates selectively on data and allows for defining non-sensitive-temporal regions without DP application or combining differential privacy with other privacy techniques within data samples.
arXiv Detail & Related papers (2024-10-22T15:22:53Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Asymptotic utility of spectral anonymization [0.0]
We study the utility and privacy of the spectral anonymization (SA) algorithm.
We introduce two novel SA variants: $mathcalJ$-spectral anonymization and matrix $mathcalO$-spectral anonymization.
We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data.
arXiv Detail & Related papers (2024-05-28T07:53:20Z) - A Novel Cross-Perturbation for Single Domain Generalization [54.612933105967606]
Single domain generalization aims to enhance the ability of the model to generalize to unknown domains when trained on a single source domain.
The limited diversity in the training data hampers the learning of domain-invariant features, resulting in compromised generalization performance.
We propose CPerb, a simple yet effective cross-perturbation method to enhance the diversity of the training data.
arXiv Detail & Related papers (2023-08-02T03:16:12Z) - Privacy- and Utility-Preserving NLP with Anonymized Data: A case study
of Pseudonymization [22.84767881115746]
Our work provides crucial insights into the gaps between original and anonymized data.
We make our code, pseudonymized datasets, and downstream models publicly available.
arXiv Detail & Related papers (2023-06-08T21:06:19Z) - Self-Paced Learning for Open-Set Domain Adaptation [50.620824701934]
Traditional domain adaptation methods presume that the classes in the source and target domains are identical.
Open-set domain adaptation (OSDA) addresses this limitation by allowing previously unseen classes in the target domain.
We propose a novel framework based on self-paced learning to distinguish common and unknown class samples.
arXiv Detail & Related papers (2023-03-10T14:11:09Z) - Cluster-level pseudo-labelling for source-free cross-domain facial
expression recognition [94.56304526014875]
We propose the first Source-Free Unsupervised Domain Adaptation (SFUDA) method for Facial Expression Recognition (FER)
Our method exploits self-supervised pretraining to learn good feature representations from the target data.
We validate the effectiveness of our method in four adaptation setups, proving that it consistently outperforms existing SFUDA methods when applied to FER.
arXiv Detail & Related papers (2022-10-11T08:24:50Z) - On the utility and protection of optimization with differential privacy
and classic regularization techniques [9.413131350284083]
We study the effectiveness of the differentially-private descent (DP-SGD) algorithm against standard optimization practices with regularization techniques.
We discuss differential privacy's flaws and limits and empirically demonstrate the often superior privacy-preserving properties of dropout and l2-regularization.
arXiv Detail & Related papers (2022-09-07T14:10:21Z) - Selecting the suitable resampling strategy for imbalanced data
classification regarding dataset properties [62.997667081978825]
In many application domains such as medicine, information retrieval, cybersecurity, social media, etc., datasets used for inducing classification models often have an unequal distribution of the instances of each class.
This situation, known as imbalanced data classification, causes low predictive performance for the minority class examples.
Oversampling and undersampling techniques are well-known strategies to deal with this problem by balancing the number of examples of each class.
arXiv Detail & Related papers (2021-12-15T18:56:39Z) - $k$-Anonymity in Practice: How Generalisation and Suppression Affect
Machine Learning Classifiers [2.4282642968872037]
We investigate the effects of different $k$-anonymisation algorithms on the results of machine learning models.
Our systematic evaluation shows that with an increasingly strong $k$-anonymity constraint, the classification performance generally degrades.
Mondrian can be considered as the method with the most appealing properties for subsequent classification.
arXiv Detail & Related papers (2021-02-09T11:28:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.