Related papers: PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels

URL: http://arxiv.org/abs/2304.00047v1
Date: Fri, 31 Mar 2023 18:03:53 GMT
Title: PEOPL: Characterizing Privately Encoded Open Datasets with Public Labels
Authors: Homa Esfahanizadeh, Adam Yala, Rafael G. L. D'Oliveira, Andrea J. D. Jaba, Victor Quach, Ken R. Duffy, Tommi S. Jaakkola, Vinod Vaikuntanathan, Manya Ghobadi, Regina Barzilay, Muriel M\'edard
Abstract summary: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks.
Score: 59.66777287810985
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Allowing organizations to share their data for training of machine learning (ML) models without unintended information leakage is an open problem in practice. A promising technique for this still-open problem is to train models on the encoded data. Our approach, called Privately Encoded Open Datasets with Public Labels (PEOPL), uses a certain class of randomly constructed transforms to encode sensitive data. Organizations publish their randomly encoded data and associated raw labels for ML training, where training is done without knowledge of the encoding realization. We investigate several important aspects of this problem: We introduce information-theoretic scores for privacy and utility, which quantify the average performance of an unfaithful user (e.g., adversary) and a faithful user (e.g., model developer) that have access to the published encoded data. We then theoretically characterize primitives in building families of encoding schemes that motivate the use of random deep neural networks. Empirically, we compare the performance of our randomized encoding scheme and a linear scheme to a suite of computational attacks, and we also show that our scheme achieves competitive prediction accuracy to raw-sample baselines. Moreover, we demonstrate that multiple institutions, using independent random encoders, can collaborate to train improved ML models.

Related papers

Private Training & Data Generation by Clustering Embeddings [74.00687214400021]
Differential privacy (DP) provides a robust framework for protecting individual data.<n>We introduce a novel principled method for DP synthetic image embedding generation.<n> Empirically, a simple two-layer neural network trained on synthetically generated embeddings achieves state-of-the-art (SOTA) classification accuracy.
arXiv Detail & Related papers (2025-06-20T00:17:14Z)
Hey, That's My Data! Label-Only Dataset Inference in Large Language Models [63.35066172530291]
CatShift is a label-only dataset-inference framework.<n>It capitalizes on catastrophic forgetting: the tendency of an LLM to overwrite previously learned knowledge when exposed to new data.
arXiv Detail & Related papers (2025-06-06T13:02:59Z)
OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems. While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited. We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z)
Robust Representation Learning for Privacy-Preserving Machine Learning: A Multi-Objective Autoencoder Approach [0.9831489366502302]
We propose a robust representation learning framework for privacy-preserving machine learning (ppML) Our method centers on training autoencoders in a multi-objective manner and then concatenating the latent and learned features from the encoding part as the encoded form of our data. With our proposed framework, we can share our data and use third party tools without being under the threat of revealing its original form.
arXiv Detail & Related papers (2023-09-08T16:41:25Z)
Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space. We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
Data Encoding For Healthcare Data Democratisation and Information Leakage Prevention [23.673071967945358]
This paper argues that irreversible data encoding can provide an effective solution to achieve data democratization. It exploits random projections and random quantum encoding to realize this framework for dense and longitudinal or time-series data. Experimental evaluation highlights that models trained on encoded time-series data effectively uphold the information bottleneck principle.
arXiv Detail & Related papers (2023-05-05T17:50:50Z)
Enhancing Multiple Reliability Measures via Nuisance-extended Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition. We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z)
Multi-class Classifier based Failure Prediction with Artificial and Anonymous Training for Data Privacy [0.0]
A neural network based multi-class classifier is developed for failure prediction. The proposed mechanism completely decouples the data set used for training process from the actual data which is kept private. Results show high accuracy in failure prediction under different parameter configurations.
arXiv Detail & Related papers (2022-09-06T07:53:33Z)
Discrete Key-Value Bottleneck [95.61236311369821]
Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. We propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes.
arXiv Detail & Related papers (2022-07-22T17:52:30Z)
Uncertainty-Autoencoder-Based Privacy and Utility Preserving Data Type Conscious Transformation [3.7315964084413173]
We propose an adversarial learning framework that deals with the privacy-utility tradeoff problem under two conditions. Under data-type ignorant conditions, the privacy mechanism provides a one-hot encoding of categorical features, representing exactly one class. Under data-type aware conditions, the categorical variables are represented by a collection of scores, one for each class.
arXiv Detail & Related papers (2022-05-04T08:40:15Z)
Privacy-Preserving Federated Learning via System Immersion and Random Matrix Encryption [4.258856853258348]
Federated learning (FL) has emerged as a privacy solution for collaborative distributed learning where clients train AI models directly on their devices instead of sharing their data with a centralized (potentially adversarial) server. We propose a Privacy-Preserving Federated Learning (PPFL) framework built on the synergy of matrix encryption and system immersion tools from control theory. We show that our algorithm provides the same level of accuracy and convergence rate as the standard FL with a negligible cost while revealing no information about clients' data.
arXiv Detail & Related papers (2022-04-05T21:28:59Z)
NeuraCrypt: Hiding Private Health Data via Random Neural Networks for Public Training [64.54200987493573]
We propose NeuraCrypt, a private encoding scheme based on random deep neural networks. NeuraCrypt encodes raw patient data using a randomly constructed neural network known only to the data-owner. We show that NeuraCrypt achieves competitive accuracy to non-private baselines on a variety of x-ray tasks.
arXiv Detail & Related papers (2021-06-04T13:42:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.