Data Collaboration Analysis applied to Compound Datasets and the
Introduction of Projection data to Non-IID settings
- URL: http://arxiv.org/abs/2308.00280v1
- Date: Tue, 1 Aug 2023 04:37:08 GMT
- Title: Data Collaboration Analysis applied to Compound Datasets and the
Introduction of Projection data to Non-IID settings
- Authors: Akihiro Mizoguchi, Anna Bogdanova, Akira Imakura, and Tetsuya Sakurai
- Abstract summary: Federated learning has been applied to compound datasets to increase their prediction accuracy while safeguarding potentially proprietary information.
We propose an alternative method of distributed machine learning to chemical compound data from open sources, called data collaboration analysis (DCPd)
DCPd exhibited a negligible decline in classification accuracy in experiments with different degrees of label bias.
- Score: 6.037276428689637
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Given the time and expense associated with bringing a drug to market,
numerous studies have been conducted to predict the properties of compounds
based on their structure using machine learning. Federated learning has been
applied to compound datasets to increase their prediction accuracy while
safeguarding potentially proprietary information. However, federated learning
is encumbered by low accuracy in not identically and independently distributed
(non-IID) settings, i.e., data partitioning has a large label bias, and is
considered unsuitable for compound datasets, which tend to have large label
bias. To address this limitation, we utilized an alternative method of
distributed machine learning to chemical compound data from open sources,
called data collaboration analysis (DC). We also proposed data collaboration
analysis using projection data (DCPd), which is an improved method that
utilizes auxiliary PubChem data. This improves the quality of individual
user-side data transformations for the projection data for the creation of
intermediate representations. The classification accuracy, i.e., area under the
curve in the receiver operating characteristic curve (ROC-AUC) and AUC in the
precision-recall curve (PR-AUC), of federated averaging (FedAvg), DC, and DCPd
was compared for five compound datasets. We determined that the machine
learning performance for non-IID settings was in the order of DCPd, DC, and
FedAvg, although they were almost the same in identically and independently
distributed (IID) settings. Moreover, the results showed that compared to other
methods, DCPd exhibited a negligible decline in classification accuracy in
experiments with different degrees of label bias. Thus, DCPd can address the
low performance in non-IID settings, which is one of the challenges of
federated learning.
Related papers
- Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout [62.73150122809138]
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices.<n>We propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD)<n>The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and cost (up to 15.0% smaller)
arXiv Detail & Related papers (2025-07-14T16:19:00Z) - Robust Molecular Property Prediction via Densifying Scarce Labeled Data [51.55434084913129]
In drug discovery, compounds most critical for advancing research often lie beyond the training set.<n>We propose a novel meta-learning-based approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data.<n>We demonstrate significant performance gains on challenging real-world datasets.
arXiv Detail & Related papers (2025-06-13T15:27:40Z) - FedDW: Distilling Weights through Consistency Optimization in Heterogeneous Federated Learning [14.477559543490242]
Federated Learning (FL) is an innovative distributed machine learning paradigm that enables neural network training across devices without centralizing data.
Previous research shows that in IID environments, the parameter structure of the model is expected to adhere to certain specific consistency principles.
This paper identifies the consistency between the two and leverages it to regulate training, underpinning our proposed FedDW framework.
Experimental results show FedDW outperforms 10 state-of-the-art FL methods, improving accuracy by an average of 3% in highly heterogeneous settings.
arXiv Detail & Related papers (2024-12-05T12:32:40Z) - Non-IID data in Federated Learning: A Systematic Review with Taxonomy, Metrics, Methods, Frameworks and Future Directions [2.9434966603161072]
This systematic review aims to fill a gap by providing a detailed taxonomy for non-IID data, partition protocols, and metrics.
We describe popular solutions to address non-IID data and standardized frameworks employed in Federated Learning with heterogeneous data.
arXiv Detail & Related papers (2024-11-19T09:53:28Z) - Dataset Distillation-based Hybrid Federated Learning on Non-IID Data [19.01147151081893]
We propose a hybrid federated learning framework called HFLDD, which integrates dataset distillation to generate independent and equally distributed (IID) data.
We partition the clients into heterogeneous clusters, where the data labels among different clients within a cluster are unbalanced.
This training process is like traditional federated learning on IID data, and hence effectively alleviates the impact of Non-IID data on model training.
arXiv Detail & Related papers (2024-09-26T03:52:41Z) - DAGnosis: Localized Identification of Data Inconsistencies using
Structures [73.39285449012255]
Identification and appropriate handling of inconsistencies in data at deployment time is crucial to reliably use machine learning models.
We use directed acyclic graphs (DAGs) to encode the training set's features probability distribution and independencies as a structure.
Our method, called DAGnosis, leverages these structural interactions to bring valuable and insightful data-centric conclusions.
arXiv Detail & Related papers (2024-02-26T11:29:16Z) - Importance-Aware Adaptive Dataset Distillation [53.79746115426363]
Development of deep learning models is enabled by the availability of large-scale datasets.
dataset distillation aims to synthesize a compact dataset that retains the essential information from the large original dataset.
We propose an importance-aware adaptive dataset distillation (IADD) method that can improve distillation performance.
arXiv Detail & Related papers (2024-01-29T03:29:39Z) - CADIS: Handling Cluster-skewed Non-IID Data in Federated Learning with
Clustered Aggregation and Knowledge DIStilled Regularization [3.3711670942444014]
Federated learning enables edge devices to train a global model collaboratively without exposing their data.
We tackle a new type of Non-IID data, called cluster-skewed non-IID, discovered in actual data sets.
We propose an aggregation scheme that guarantees equality between clusters.
arXiv Detail & Related papers (2023-02-21T02:53:37Z) - Rethinking Data Heterogeneity in Federated Learning: Introducing a New
Notion and Standard Benchmarks [65.34113135080105]
We show that not only the issue of data heterogeneity in current setups is not necessarily a problem but also in fact it can be beneficial for the FL participants.
Our observations are intuitive.
Our code is available at https://github.com/MMorafah/FL-SC-NIID.
arXiv Detail & Related papers (2022-09-30T17:15:19Z) - Data-SUITE: Data-centric identification of in-distribution incongruous
examples [81.21462458089142]
Data-SUITE is a data-centric framework to identify incongruous regions of in-distribution (ID) data.
We empirically validate Data-SUITE's performance and coverage guarantees.
arXiv Detail & Related papers (2022-02-17T18:58:31Z) - An Experimental Study of Data Heterogeneity in Federated Learning
Methods for Medical Imaging [8.984706828657814]
Federated learning enables multiple institutions to collaboratively train machine learning models on their local data in a privacy-preserving way.
We investigate the deleterious impact of a taxonomy of data heterogeneity regimes on federated learning methods, including quantity skew, label distribution skew, and imaging acquisition skew.
We present several mitigation strategies to overcome performance drops from data heterogeneity, including weighted average for data quantity skew, weighted loss and batch normalization averaging for label distribution skew.
arXiv Detail & Related papers (2021-07-18T05:47:48Z) - ORDisCo: Effective and Efficient Usage of Incremental Unlabeled Data for
Semi-supervised Continual Learning [52.831894583501395]
Continual learning assumes the incoming data are fully labeled, which might not be applicable in real applications.
We propose deep Online Replay with Discriminator Consistency (ORDisCo) to interdependently learn a classifier with a conditional generative adversarial network (GAN)
We show ORDisCo achieves significant performance improvement on various semi-supervised learning benchmark datasets for SSCL.
arXiv Detail & Related papers (2021-01-02T09:04:14Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.