Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with
Application to Fraud Detection
- URL: http://arxiv.org/abs/2208.06093v1
- Date: Fri, 12 Aug 2022 02:58:26 GMT
- Title: Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with
Application to Fraud Detection
- Authors: Yingting Liu, Chaochao Chen, Jamie Cui, Li Wang, Lei Wang
- Abstract summary: We propose a new framework for efficient sparsity-aware K-means with three characteristics.
First, our framework is divided into a data-independent offline phase and a much faster online phase.
Second, we take advantage of the vectorization techniques in both online and offline phases.
Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further.
- Score: 12.076075765740502
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: K-means is one of the most widely used clustering models in practice. Due to
the problem of data isolation and the requirement for high model performance,
how to jointly build practical and secure K-means for multiple parties has
become an important topic for many applications in the industry. Existing work
on this is mainly of two types. The first type has efficiency advantages, but
information leakage raises potential privacy risks. The second type is provable
secure but is inefficient and even helpless for the large-scale data sparsity
scenario. In this paper, we propose a new framework for efficient
sparsity-aware K-means with three characteristics. First, our framework is
divided into a data-independent offline phase and a much faster online phase,
and the offline phase allows to pre-compute almost all cryptographic
operations. Second, we take advantage of the vectorization techniques in both
online and offline phases. Third, we adopt a sparse matrix multiplication for
the data sparsity scenario to improve efficiency further. We conduct
comprehensive experiments on three synthetic datasets and deploy our model in a
real-world fraud detection task. Our experimental results show that, compared
with the state-of-the-art solution, our model achieves competitive performance
in terms of both running time and communication size, especially on sparse
datasets.
Related papers
- CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism.
We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies.
By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z) - Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes.
In this work we study a simple scheme for adaptive inference.
We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z) - REP: Resource-Efficient Prompting for On-device Continual Learning [23.92661395403251]
On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical.
It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance.
We introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods.
arXiv Detail & Related papers (2024-06-07T09:17:33Z) - Empowering HWNs with Efficient Data Labeling: A Clustered Federated
Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges.
We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios.
Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z) - Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching.
Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z) - SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud
Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures.
The proposed approach can be applied to general backbones like PointNet and DGCNN.
Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z) - Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones.
We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model
Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.
It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.
Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Bandit Data-Driven Optimization [62.01362535014316]
There are four major pain points that a machine learning pipeline must overcome in order to be useful in settings.
We introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points.
We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret.
arXiv Detail & Related papers (2020-08-26T17:50:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.