Private Synthetic Data for Multitask Learning and Marginal Queries
- URL: http://arxiv.org/abs/2209.07400v1
- Date: Thu, 15 Sep 2022 16:00:44 GMT
- Title: Private Synthetic Data for Multitask Learning and Marginal Queries
- Authors: Giuseppe Vietri, Cedric Archambeau, Sergul Aydore, William Brown,
Michael Kearns, Aaron Roth, Ankit Siva, Shuai Tang, Zhiwei Steven Wu
- Abstract summary: A key innovation in our algorithm is the ability to directly handle numerical features.
Eliminating the need for binning allows us to produce synthetic data preserving large numbers of statistical queries.
Our method consistently runs 2-5x faster than the best comparable techniques.
- Score: 30.123686707904543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We provide a differentially private algorithm for producing synthetic data
simultaneously useful for multiple tasks: marginal queries and multitask
machine learning (ML). A key innovation in our algorithm is the ability to
directly handle numerical features, in contrast to a number of related prior
approaches which require numerical features to be first converted into {high
cardinality} categorical features via {a binning strategy}. Higher binning
granularity is required for better accuracy, but this negatively impacts
scalability. Eliminating the need for binning allows us to produce synthetic
data preserving large numbers of statistical queries such as marginals on
numerical features, and class conditional linear threshold queries. Preserving
the latter means that the fraction of points of each class label above a
particular half-space is roughly the same in both the real and synthetic data.
This is the property that is needed to train a linear classifier in a multitask
setting. Our algorithm also allows us to produce high quality synthetic data
for mixed marginal queries, that combine both categorical and numerical
features. Our method consistently runs 2-5x faster than the best comparable
techniques, and provides significant accuracy improvements in both marginal
queries and linear prediction tasks for mixed-type datasets.
Related papers
- Online Nonparametric Supervised Learning for Massive Data [0.0]
We develop a fast algorithm adapted to the real-time calculation of the nonparametric classifier in massive as well as streaming data frameworks.
The proposed methods are evaluated and compared to some commonly used machine learning algorithms for real-time fetal well-being monitoring.
arXiv Detail & Related papers (2024-05-29T20:04:23Z) - Spectral Clustering of Categorical and Mixed-type Data via Extra Graph
Nodes [0.0]
This paper explores a more natural way to incorporate both numerical and categorical information into the spectral clustering algorithm.
We propose adding extra nodes corresponding to the different categories the data may belong to and show that it leads to an interpretable clustering objective function.
We demonstrate that this simple framework leads to a linear-time spectral clustering algorithm for categorical-only data.
arXiv Detail & Related papers (2024-03-08T20:49:49Z) - A Weighted K-Center Algorithm for Data Subset Selection [70.49696246526199]
Subset selection is a fundamental problem that can play a key role in identifying smaller portions of the training data.
We develop a novel factor 3-approximation algorithm to compute subsets based on the weighted sum of both k-center and uncertainty sampling objective functions.
arXiv Detail & Related papers (2023-12-17T04:41:07Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - Nonlinear Feature Aggregation: Two Algorithms driven by Theory [45.3190496371625]
Real-world machine learning applications are characterized by a huge number of features, leading to computational and memory issues.
We propose a dimensionality reduction algorithm (NonLinCFA) which aggregates non-linear transformations of features with a generic aggregation function.
We also test the algorithms on synthetic and real-world datasets, performing regression and classification tasks, showing competitive performances.
arXiv Detail & Related papers (2023-06-19T19:57:33Z) - Practical Approaches for Fair Learning with Multitype and Multivariate
Sensitive Attributes [70.6326967720747]
It is important to guarantee that machine learning algorithms deployed in the real world do not result in unfairness or unintended social consequences.
We introduce FairCOCCO, a fairness measure built on cross-covariance operators on reproducing kernel Hilbert Spaces.
We empirically demonstrate consistent improvements against state-of-the-art techniques in balancing predictive power and fairness on real-world datasets.
arXiv Detail & Related papers (2022-11-11T11:28:46Z) - Asymmetric Scalable Cross-modal Hashing [51.309905690367835]
Cross-modal hashing is a successful method to solve large-scale multimedia retrieval issue.
We propose a novel Asymmetric Scalable Cross-Modal Hashing (ASCMH) to address these issues.
Our ASCMH outperforms the state-of-the-art cross-modal hashing methods in terms of accuracy and efficiency.
arXiv Detail & Related papers (2022-07-26T04:38:47Z) - Efficient and Near-Optimal Smoothed Online Learning for Generalized
Linear Functions [28.30744223973527]
We give a computationally efficient algorithm that is the first to enjoy the statistically optimal log(T/sigma) regret for realizable K-wise linear classification.
We develop a novel characterization of the geometry of the disagreement region induced by generalized linear classifiers.
arXiv Detail & Related papers (2022-05-25T21:31:36Z) - AutoSimulate: (Quickly) Learning Synthetic Data Generation [70.82315853981838]
We propose an efficient alternative for optimal synthetic data generation based on a novel differentiable approximation of the objective.
We demonstrate that the proposed method finds the optimal data distribution faster (up to $50times$), with significantly reduced training data generation (up to $30times$) and better accuracy ($+8.7%$) on real-world test datasets than previous methods.
arXiv Detail & Related papers (2020-08-16T11:36:11Z) - Supervised Quantile Normalization for Low-rank Matrix Approximation [50.445371939523305]
We learn the parameters of quantile normalization operators that can operate row-wise on the values of $X$ and/or of its factorization $UV$ to improve the quality of the low-rank representation of $X$ itself.
We demonstrate the applicability of these techniques on synthetic and genomics datasets.
arXiv Detail & Related papers (2020-02-08T21:06:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.