Privacy-Preserving Training of Tree Ensembles over Continuous Data
- URL: http://arxiv.org/abs/2106.02769v1
- Date: Sat, 5 Jun 2021 01:28:59 GMT
- Title: Privacy-Preserving Training of Tree Ensembles over Continuous Data
- Authors: Samuel Adams, Chaitali Choudhary, Martine De Cock, Rafael Dowsley,
David Melanson, Anderson C. A. Nascimento, Davis Railsback, Jianwei Shen
- Abstract summary: Most existing protocols for privacy-preserving training of decision trees over distributed data assume that the features are categorical.
Sorting is an expensive operation in MPC, hence finding secure protocols that avoid such an expensive step is a relevant problem in privacy-preserving machine learning.
We propose three more efficient alternatives for secure training of decision tree based models on data with continuous features.
- Score: 9.887824375079553
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing Secure Multi-Party Computation (MPC) protocols for
privacy-preserving training of decision trees over distributed data assume that
the features are categorical. In real-life applications, features are often
numerical. The standard ``in the clear'' algorithm to grow decision trees on
data with continuous values requires sorting of training examples for each
feature in the quest for an optimal cut-point in the range of feature values in
each node. Sorting is an expensive operation in MPC, hence finding secure
protocols that avoid such an expensive step is a relevant problem in
privacy-preserving machine learning. In this paper we propose three more
efficient alternatives for secure training of decision tree based models on
data with continuous features, namely: (1) secure discretization of the data,
followed by secure training of a decision tree over the discretized data; (2)
secure discretization of the data, followed by secure training of a random
forest over the discretized data; and (3) secure training of extremely
randomized trees (``extra-trees'') on the original data. Approaches (2) and (3)
both involve randomizing feature choices. In addition, in approach (3)
cut-points are chosen randomly as well, thereby alleviating the need to sort or
to discretize the data up front. We implemented all proposed solutions in the
semi-honest setting with additive secret sharing based MPC. In addition to
mathematically proving that all proposed approaches are correct and secure, we
experimentally evaluated and compared them in terms of classification accuracy
and runtime. We privately train tree ensembles over data sets with 1000s of
instances or features in a few minutes, with accuracies that are at par with
those obtained in the clear. This makes our solution orders of magnitude more
efficient than the existing approaches, which are based on oblivious sorting.
Related papers
- Differentially-Private Decision Trees and Provable Robustness to Data
Poisoning [8.649768969060647]
Decision trees are interpretable models that are well-suited to non-linear learning problems.
Current state-of-the-art algorithms for this purpose sacrifice much utility for a small privacy benefit.
We propose PrivaTree based on private histograms that chooses good splits while consuming a small privacy budget.
arXiv Detail & Related papers (2023-05-24T17:56:18Z) - In all LikelihoodS: How to Reliably Select Pseudo-Labeled Data for
Self-Training in Semi-Supervised Learning [0.0]
Self-training is a simple yet effective method within semi-supervised learning.
In this paper, we aim at rendering PLS more robust towards the involved modeling assumptions.
Results suggest that in particular robustness w.r.t. model choice can lead to substantial accuracy gains.
arXiv Detail & Related papers (2023-03-02T10:00:37Z) - Policy learning "without'' overlap: Pessimism and generalized empirical
Bernstein's inequality [107.84979976896912]
offline policy learning aims at utilizing observations collected a priori to learn an optimal individualized decision rule.
Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics are lower bounded in the offline dataset.
We propose a new algorithm that optimize lower confidence bounds (LCBs) -- instead of point estimates -- of the policy values.
arXiv Detail & Related papers (2022-12-19T22:43:08Z) - Discrete Tree Flows via Tree-Structured Permutations [5.929956715430168]
discrete flow-based models cannot be straightforwardly optimized with conventional deep learning methods because gradients of discrete functions are undefined or zero.
Our approach seeks to reduce computational burden and remove the need for pseudo-gradients by developing a discrete flow based on decision trees.
arXiv Detail & Related papers (2022-07-04T23:11:04Z) - Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones.
We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z) - Learning to Hash Naturally Sorts [84.90210592082829]
We introduce Naturally-Sorted Hashing (NSH) to train a deep hashing model with sorted results end-to-end.
NSH sort the Hamming distances of samples' hash codes and accordingly gather their latent representations for self-supervised training.
We describe a novel Sorted Noise-Contrastive Estimation (SortedNCE) loss that selectively picks positive and negative samples for contrastive learning.
arXiv Detail & Related papers (2022-01-31T16:19:02Z) - Compactness Score: A Fast Filter Method for Unsupervised Feature
Selection [66.84571085643928]
We propose a fast unsupervised feature selection method, named as, Compactness Score (CSUFS) to select desired features.
Our proposed algorithm seems to be more accurate and efficient compared with existing algorithms.
arXiv Detail & Related papers (2022-01-31T13:01:37Z) - Optimal randomized classification trees [0.0]
Classification and Regression Trees (CARTs) are off-the-shelf techniques in modern Statistics and Machine Learning.
CARTs are built by means of a greedy procedure, sequentially deciding the splitting predictor variable(s) and the associated threshold.
This greedy approach trains trees very fast, but, by its nature, their classification accuracy may not be competitive against other state-of-the-art procedures.
arXiv Detail & Related papers (2021-10-19T11:41:12Z) - Scalable and Provably Accurate Algorithms for Differentially Private
Distributed Decision Tree Learning [34.79337646727395]
This paper introduces the first provably accurate algorithms for differentially private, top-down decision tree learning in the distributed setting.
We propose DP-TopDown, a general privacy preserving decision tree learning algorithm, and present two distributed implementations.
arXiv Detail & Related papers (2020-12-19T06:09:36Z) - New Oracle-Efficient Algorithms for Private Synthetic Data Release [52.33506193761153]
We present three new algorithms for constructing differentially private synthetic data.
The algorithms satisfy differential privacy even in the worst case.
Compared to the state-of-the-art method High-Dimensional Matrix Mechanism citeMcKennaMHM18, our algorithms provide better accuracy in the large workload.
arXiv Detail & Related papers (2020-07-10T15:46:05Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.