Another Use of SMOTE for Interpretable Data Collaboration Analysis
- URL: http://arxiv.org/abs/2208.12458v1
- Date: Fri, 26 Aug 2022 06:39:13 GMT
- Title: Another Use of SMOTE for Interpretable Data Collaboration Analysis
- Authors: Akira Imakura, Masateru Kihira, Yukihiko Okada, Tetsuya Sakurai
- Abstract summary: Data collaboration (DC) analysis has been developed for privacy-preserving integrated analysis across multiple institutions.
This study proposes an anchor data construction technique to improve the recognition performance without increasing the risk of data leakage.
- Score: 8.143750358586072
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, data collaboration (DC) analysis has been developed for
privacy-preserving integrated analysis across multiple institutions. DC
analysis centralizes individually constructed dimensionality-reduced
intermediate representations and realizes integrated analysis via collaboration
representations without sharing the original data. To construct the
collaboration representations, each institution generates and shares a
shareable anchor dataset and centralizes its intermediate representation.
Although, random anchor dataset functions well for DC analysis in general,
using an anchor dataset whose distribution is close to that of the raw dataset
is expected to improve the recognition performance, particularly for the
interpretable DC analysis. Based on an extension of the synthetic minority
over-sampling technique (SMOTE), this study proposes an anchor data
construction technique to improve the recognition performance without
increasing the risk of data leakage. Numerical results demonstrate the
efficiency of the proposed SMOTE-based method over the existing anchor data
constructions for artificial and real-world datasets. Specifically, the
proposed method achieves 9 percentage point and 38 percentage point performance
improvements regarding accuracy and essential feature selection, respectively,
over existing methods for an income dataset. The proposed method provides
another use of SMOTE not for imbalanced data classifications but for a key
technology of privacy-preserving integrated analysis.
Related papers
- Sparse outlier-robust PCA for multi-source data [2.3226893628361687]
We introduce a novel PCA methodology that simultaneously selects important features as well as local source-specific patterns.
We develop a regularization problem with a penalty that accommodates global-local structured sparsity patterns.
We provide an efficient implementation of our proposal via the Alternating Direction Method of Multiplier.
arXiv Detail & Related papers (2024-07-23T08:55:03Z) - Cross-feature Contrastive Loss for Decentralized Deep Learning on
Heterogeneous Data [8.946847190099206]
We present a novel approach for decentralized learning on heterogeneous data.
Cross-features for a pair of neighboring agents are the features obtained from the data of an agent with respect to the model parameters of the other agent.
Our experiments show that the proposed method achieves superior performance (0.2-4% improvement in test accuracy) compared to other existing techniques for decentralized learning on heterogeneous data.
arXiv Detail & Related papers (2023-10-24T14:48:23Z) - Source-Free Collaborative Domain Adaptation via Multi-Perspective
Feature Enrichment for Functional MRI Analysis [55.03872260158717]
Resting-state MRI functional (rs-fMRI) is increasingly employed in multi-site research to aid neurological disorder analysis.
Many methods have been proposed to reduce fMRI heterogeneity between source and target domains.
But acquiring source data is challenging due to concerns and/or data storage burdens in multi-site studies.
We design a source-free collaborative domain adaptation framework for fMRI analysis, where only a pretrained source model and unlabeled target data are accessible.
arXiv Detail & Related papers (2023-08-24T01:30:18Z) - Towards High-Performance Exploratory Data Analysis (EDA) Via Stable
Equilibrium Point [5.825190876052149]
We introduce a stable equilibrium point (SEP) - based framework for improving the efficiency and solution quality of EDA.
A very unique property of the proposed method is that the SEPs will directly encode the clustering properties of data sets.
arXiv Detail & Related papers (2023-06-07T13:31:57Z) - Non-readily identifiable data collaboration analysis for multiple
datasets including personal information [7.315551060433141]
Data confidentiality and cross-institutional communication are critical for medical datasets.
In this study, the identifiability of the data collaboration analysis is investigated.
The proposed method exhibits a non-readily identifiability while maintaining a high recognition performance.
arXiv Detail & Related papers (2022-08-31T03:19:17Z) - Domain Adaptation Principal Component Analysis: base linear method for
learning with out-of-distribution data [55.41644538483948]
Domain adaptation is a popular paradigm in modern machine learning.
We present a method called Domain Adaptation Principal Component Analysis (DAPCA)
DAPCA finds a linear reduced data representation useful for solving the domain adaptation task.
arXiv Detail & Related papers (2022-08-28T21:10:56Z) - DRFLM: Distributionally Robust Federated Learning with Inter-client
Noise via Local Mixup [58.894901088797376]
federated learning has emerged as a promising approach for training a global model using data from multiple organizations without leaking their raw data.
We propose a general framework to solve the above two challenges simultaneously.
We provide comprehensive theoretical analysis including robustness analysis, convergence analysis, and generalization ability.
arXiv Detail & Related papers (2022-04-16T08:08:29Z) - CAFE: Learning to Condense Dataset by Aligning Features [72.99394941348757]
We propose a novel scheme to Condense dataset by Aligning FEatures (CAFE)
At the heart of our approach is an effective strategy to align features from the real and synthetic data across various scales.
We validate the proposed CAFE across various datasets, and demonstrate that it generally outperforms the state of the art.
arXiv Detail & Related papers (2022-03-03T05:58:49Z) - Privacy-preserving Logistic Regression with Secret Sharing [0.0]
We propose secret sharing-based privacy-preserving logistic regression protocols using the Newton-Raphson method.
Our implementation results show that our improved method can handle large datasets used in securely training a logistic regression from multiple sources.
arXiv Detail & Related papers (2021-05-14T14:53:50Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Causal Feature Selection for Algorithmic Fairness [61.767399505764736]
We consider fairness in the integration component of data management.
We propose an approach to identify a sub-collection of features that ensure the fairness of the dataset.
arXiv Detail & Related papers (2020-06-10T20:20:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.