PADME-SoSci: A Platform for Analytics and Distributed Machine Learning
for the Social Sciences
- URL: http://arxiv.org/abs/2303.18200v2
- Date: Mon, 3 Apr 2023 07:27:28 GMT
- Title: PADME-SoSci: A Platform for Analytics and Distributed Machine Learning
for the Social Sciences
- Authors: Zeyd Boukhers and Arnim Bleier and Yeliz Ucer Yediel and Mio
Hienstorfer-Heitmann and Mehrshad Jaberansary and Adamantios Koumpis and Oya
Beyan
- Abstract summary: PADME is a distributed analytics tool that federates model implementation and training.
It enables the analysis of data across locations while still allowing the model to be trained as if all data were in a single location.
- Score: 4.294774517325059
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data privacy and ownership are significant in social data science, raising
legal and ethical concerns. Sharing and analyzing data is difficult when
different parties own different parts of it. An approach to this challenge is
to apply de-identification or anonymization techniques to the data before
collecting it for analysis. However, this can reduce data utility and increase
the risk of re-identification. To address these limitations, we present PADME,
a distributed analytics tool that federates model implementation and training.
PADME uses a federated approach where the model is implemented and deployed by
all parties and visits each data location incrementally for training. This
enables the analysis of data across locations while still allowing the model to
be trained as if all data were in a single location. Training the model on data
in its original location preserves data ownership. Furthermore, the results are
not provided until the analysis is completed on all data locations to ensure
privacy and avoid bias in the results.
Related papers
- Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data.
One key challenge in federated learning is to handle non-identically distributed data across the clients.
We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z) - Benchmarking FedAvg and FedCurv for Image Classification Tasks [1.376408511310322]
This paper focuses on the problem of statistical heterogeneity of the data in the same federated network.
Several Federated Learning algorithms, such as FedAvg, FedProx and Federated Curvature (FedCurv) have already been proposed.
As a side product of this work, we release the non-IID version of the datasets we used so to facilitate further comparisons from the FL community.
arXiv Detail & Related papers (2023-03-31T10:13:01Z) - Membership Inference Attacks against Synthetic Data through Overfitting
Detection [84.02632160692995]
We argue for a realistic MIA setting that assumes the attacker has some knowledge of the underlying data distribution.
We propose DOMIAS, a density-based MIA model that aims to infer membership by targeting local overfitting of the generative model.
arXiv Detail & Related papers (2023-02-24T11:27:39Z) - Distributed sequential federated learning [0.0]
We develop a data-driven method for efficiently and effectively aggregating valued information by analyzing local data.
We use numerical studies of simulated data and an application to COVID-19 data collected from 32 hospitals in Mexico.
arXiv Detail & Related papers (2023-01-31T21:20:45Z) - Utility Assessment of Synthetic Data Generation Methods [0.0]
We investigate whether different methods of generating fully synthetic data vary in their utility a priori.
We find some methods to perform better than others across the board.
We do get promising findings for classification tasks when using synthetic data for training machine learning models.
arXiv Detail & Related papers (2022-11-23T11:09:52Z) - Differentially Private Multi-Party Data Release for Linear Regression [40.66319371232736]
Differentially Private (DP) data release is a promising technique to disseminate data without compromising the privacy of data subjects.
In this paper we focus on the multi-party setting, where different stakeholders own disjoint sets of attributes belonging to the same group of data subjects.
We propose our novel method and prove it converges to the optimal (non-private) solutions with increasing dataset size.
arXiv Detail & Related papers (2022-06-16T08:32:17Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - Towards Fair Federated Learning with Zero-Shot Data Augmentation [123.37082242750866]
Federated learning has emerged as an important distributed learning paradigm, where a server aggregates a global model from many client-trained models while having no access to the client data.
We propose a novel federated learning system that employs zero-shot data augmentation on under-represented data to mitigate statistical heterogeneity and encourage more uniform accuracy performance across clients in federated networks.
We study two variants of this scheme, Fed-ZDAC (federated learning with zero-shot data augmentation at the clients) and Fed-ZDAS (federated learning with zero-shot data augmentation at the server).
arXiv Detail & Related papers (2021-04-27T18:23:54Z) - Dataset Cartography: Mapping and Diagnosing Datasets with Training
Dynamics [118.75207687144817]
We introduce Data Maps, a model-based tool to characterize and diagnose datasets.
We leverage a largely ignored source of information: the behavior of the model on individual instances during training.
Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.
arXiv Detail & Related papers (2020-09-22T20:19:41Z) - Private data sharing between decentralized users through the privGAN
architecture [1.3923892290096642]
We propose a method for data owners to share synthetic or fake versions of their data without sharing the actual data.
We demonstrate that this approach, when applied to subsets of various sizes, leads to better utility for the owners than the utility from their real datasets.
arXiv Detail & Related papers (2020-09-14T22:06:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.