PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels
- URL: http://arxiv.org/abs/2304.00047v1
- Date: Fri, 31 Mar 2023 18:03:53 GMT
- Title: PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels
- Authors: Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira, Andrea J. D.
Jaba, Victor Quach, Ken R. Duffy, Tommi S. Jaakkola, Vinod Vaikuntanathan,
Manya Ghobadi, Regina Barzilay, Muriel M\'edard
- Abstract summary: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user.
We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
- Score: 59.66777287810985
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Allowing organizations to share their data for training of machine learning
(ML) models without unintended information leakage is an open problem in
practice. A promising technique for this still-open problem is to train models
on the encoded data. Our approach, called Privately Encoded Open Datasets with
Public Labels (PEOPL), uses a certain class of randomly constructed transforms
to encode sensitive data. Organizations publish their randomly encoded data and
associated raw labels for ML training, where training is done without knowledge
of the encoding realization. We investigate several important aspects of this
problem: We introduce information-theoretic scores for privacy and utility,
which quantify the average performance of an unfaithful user (e.g., adversary)
and a faithful user (e.g., model developer) that have access to the published
encoded data. We then theoretically characterize primitives in building
families of encoding schemes that motivate the use of random deep neural
networks. Empirically, we compare the performance of our randomized encoding
scheme and a linear scheme to a suite of computational attacks, and we also
show that our scheme achieves competitive prediction accuracy to raw-sample
baselines. Moreover, we demonstrate that multiple institutions, using
independent random encoders, can collaborate to train improved ML models.
Related papers
- OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning, tasks and agent systems.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an open cookbook'' for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Robust Representation Learning for Privacy-Preserving Machine Learning:
A Multi-Objective Autoencoder Approach [0.9831489366502302]
We propose a robust representation learning framework for privacy-preserving machine learning (ppML)
Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data.
With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form.
arXiv Detail & Related papers (2023-09-08T16:41:25Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z) - Data Encoding For Healthcare Data Democratisation and Information
Leakage Prevention [23.673071967945358]
This paper argues that irreversible data encoding can provide an effective solution to achieve data democratization.
It exploits random projections and random quantum encoding to realize this framework for dense and longitudinal or time-series data.
Experimental evaluation highlights that models trained on encoded time-series data effectively uphold the information bottleneck principle.
arXiv Detail & Related papers (2023-05-05T17:50:50Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Multi-class Classifier based Failure Prediction with Artificial and
Anonymous Training for Data Privacy [0.0]
A neural network based multi-class classifier is developed for failure prediction.
The proposed mechanism completely decouples the data set used for training process from the actual data which is kept private.
Results show high accuracy in failure prediction under different parameter configurations.
arXiv Detail & Related papers (2022-09-06T07:53:33Z) - Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant.
One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning.
Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks.
We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z) - Uncertainty-Autoencoder-Based Privacy and Utility Preserving Data Type
Conscious Transformation [3.7315964084413173]
We propose an adversarial learning framework that deals with the privacy-utility tradeoff problem under two conditions.
Under data-type ignorant conditions, the privacy mechanism provides a one-hot encoding of categorical features, representing exactly one class.
Under data-type aware conditions, the categorical variables are represented by a collection of scores, one for each class.
arXiv Detail & Related papers (2022-05-04T08:40:15Z) - Privacy-Preserving Federated Learning via System Immersion and Random
Matrix Encryption [4.258856853258348]
Federated learning (FL) has emerged as a privacy solution for collaborative distributed learning where clients train AI models directly on their devices instead of sharing their data with a centralized (potentially adversarial) server.
We propose a Privacy-Preserving Federated Learning (PPFL) framework built on the synergy of matrix encryption and system immersion tools from control theory.
We show that our algorithm provides the same level of accuracy and convergence rate as the standard FL with a negligible cost while revealing no information about clients' data.
arXiv Detail & Related papers (2022-04-05T21:28:59Z) - NeuraCrypt: Hiding Private Health Data via Random Neural Networks for
Public Training [64.54200987493573]
We propose NeuraCrypt, a private encoding scheme based on random deep neural networks.
NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner.
We show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks.
arXiv Detail & Related papers (2021-06-04T13:42:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.