Balance is key: Private median splits yield high-utility random trees
- URL: http://arxiv.org/abs/2006.08795v2
- Date: Sat, 20 Feb 2021 01:18:15 GMT
- Title: Balance is key: Private median splits yield high-utility random trees
- Authors: Shorya Consul, Sinead A. Williamson
- Abstract summary: We propose DiPriMe forests, a novel tree-based ensemble method for differentially private regression and classification.
We show theoretically and empirically that the resulting algorithm exhibits high utility, while ensuring differential privacy.
- Score: 4.90655427233754
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Random forests are a popular method for classification and regression due to
their versatility. However, this flexibility can come at the cost of user
privacy, since training random forests requires multiple data queries, often on
small, identifiable subsets of the training data. Privatizing these queries
typically comes at a high utility cost, in large part because we are
privatizing queries on small subsets of the data, which are easily corrupted by
added noise. In this paper, we propose DiPriMe forests, a novel tree-based
ensemble method for differentially private regression and classification, which
is appropriate for real or categorical covariates. We generate splits using a
differentially private version of the median, which encourages balanced leaf
nodes. By avoiding low occupancy leaf nodes, we avoid high signal-to-noise
ratios when privatizing the leaf node sufficient statistics. We show
theoretically and empirically that the resulting algorithm exhibits high
utility, while ensuring differential privacy.
Related papers
- Optimal Unbiased Randomizers for Regression with Label Differential
Privacy [61.63619647307816]
We propose a new family of label randomizers for training regression models under the constraint of label differential privacy (DP)
We demonstrate that these randomizers achieve state-of-the-art privacy-utility trade-offs on several datasets.
arXiv Detail & Related papers (2023-12-09T19:58:34Z) - Little is Enough: Improving Privacy by Sharing Labels in Federated Semi-Supervised Learning [10.972006295280636]
In many critical applications, sensitive data is inherently distributed and cannot be centralized due to privacy concerns.
Most of these approaches either share local model parameters, soft predictions on a public dataset, or a combination of both.
This, however, still discloses private information and restricts local models to those that lend themselves to training via gradient-based methods.
We propose to share only hard labels on a public unlabeled dataset, and use a consensus over the shared labels as a pseudo-labeling to be used by clients.
arXiv Detail & Related papers (2023-10-09T13:16:10Z) - Differentially-Private Decision Trees and Provable Robustness to Data
Poisoning [8.649768969060647]
Decision trees are interpretable models that are well-suited to non-linear learning problems.
Current state-of-the-art algorithms for this purpose sacrifice much utility for a small privacy benefit.
We propose PrivaTree based on private histograms that chooses good splits while consuming a small privacy budget.
arXiv Detail & Related papers (2023-05-24T17:56:18Z) - Smooth Anonymity for Sparse Graphs [69.1048938123063]
differential privacy has emerged as the gold standard of privacy, however, when it comes to sharing sparse datasets.
In this work, we consider a variation of $k$-anonymity, which we call smooth-$k$-anonymity, and design simple large-scale algorithms that efficiently provide smooth-$k$-anonymity.
arXiv Detail & Related papers (2022-07-13T17:09:25Z) - Mixed Differential Privacy in Computer Vision [133.68363478737058]
AdaMix is an adaptive differentially private algorithm for training deep neural network classifiers using both private and public image data.
A few-shot or even zero-shot learning baseline that ignores private data can outperform fine-tuning on a large private dataset.
arXiv Detail & Related papers (2022-03-22T06:15:43Z) - Privacy-Preserving Training of Tree Ensembles over Continuous Data [9.887824375079553]
Most existing protocols for privacy-preserving training of decision trees over distributed data assume that the features are categorical.
Sorting is an expensive operation in MPC, hence finding secure protocols that avoid such an expensive step is a relevant problem in privacy-preserving machine learning.
We propose three more efficient alternatives for secure training of decision tree based models on data with continuous features.
arXiv Detail & Related papers (2021-06-05T01:28:59Z) - Learning with User-Level Privacy [61.62978104304273]
We analyze algorithms to solve a range of learning tasks under user-level differential privacy constraints.
Rather than guaranteeing only the privacy of individual samples, user-level DP protects a user's entire contribution.
We derive an algorithm that privately answers a sequence of $K$ adaptively chosen queries with privacy cost proportional to $tau$, and apply it to solve the learning tasks we consider.
arXiv Detail & Related papers (2021-02-23T18:25:13Z) - Hiding Among the Clones: A Simple and Nearly Optimal Analysis of Privacy
Amplification by Shuffling [49.43288037509783]
We show that random shuffling amplifies differential privacy guarantees of locally randomized data.
Our result is based on a new approach that is simpler than previous work and extends to approximate differential privacy with nearly the same guarantees.
arXiv Detail & Related papers (2020-12-23T17:07:26Z) - Graph-Homomorphic Perturbations for Private Decentralized Learning [64.26238893241322]
Local exchange of estimates allows inference of data based on private data.
perturbations chosen independently at every agent, resulting in a significant performance loss.
We propose an alternative scheme, which constructs perturbations according to a particular nullspace condition, allowing them to be invisible.
arXiv Detail & Related papers (2020-10-23T10:35:35Z) - Privacy Amplification via Random Check-Ins [38.72327434015975]
Differentially Private Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data.
In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients)
Our main contribution is the emphrandom check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client.
arXiv Detail & Related papers (2020-07-13T18:14:09Z) - Fr\'echet random forests for metric space valued regression with non
euclidean predictors [0.0]
We introduce Fr'echet trees and Fr'echet random forests, which allow to handle data for which input and output variables take values in general metric spaces.
A consistency theorem for Fr'echet regressogram predictor using data-driven partitions is given and applied to Fr'echet purely uniformly random trees.
arXiv Detail & Related papers (2019-06-04T22:07:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.