Related papers: Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud Detection

Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud Detection

URL: http://arxiv.org/abs/2208.06093v1
Date: Fri, 12 Aug 2022 02:58:26 GMT
Title: Scalable and Sparsity-Aware Privacy-Preserving K-means Clustering with Application to Fraud Detection
Authors: Yingting Liu, Chaochao Chen, Jamie Cui, Li Wang, Lei Wang
Abstract summary: We propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further.
Score: 12.076075765740502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: K-means is one of the most widely used clustering models in practice. Due to the problem of data isolation and the requirement for high model performance, how to jointly build practical and secure K-means for multiple parties has become an important topic for many applications in the industry. Existing work on this is mainly of two types. The first type has efficiency advantages, but information leakage raises potential privacy risks. The second type is provable secure but is inefficient and even helpless for the large-scale data sparsity scenario. In this paper, we propose a new framework for efficient sparsity-aware K-means with three characteristics. First, our framework is divided into a data-independent offline phase and a much faster online phase, and the offline phase allows to pre-compute almost all cryptographic operations. Second, we take advantage of the vectorization techniques in both online and offline phases. Third, we adopt a sparse matrix multiplication for the data sparsity scenario to improve efficiency further. We conduct comprehensive experiments on three synthetic datasets and deploy our model in a real-world fraud detection task. Our experimental results show that, compared with the state-of-the-art solution, our model achieves competitive performance in terms of both running time and communication size, especially on sparse datasets.

Related papers

Contrastive-KAN: A Semi-Supervised Intrusion Detection Framework for Cybersecurity with scarce Labeled Data [0.0]
We propose a real-time intrusion detection system based on a semi-supervised contrastive learning framework using the Kolmogorov-Arnold Network (KAN)<n>Our method leverages abundant unlabeled data to effectively distinguish between normal and attack behaviors.<n>We validate our approach on three benchmark datasets, UNSW-NB15, BoT-IoT, and Gas Pipeline, using only 2.20%, 1.28%, and 8% of labeled samples, respectively.
arXiv Detail & Related papers (2025-07-14T21:02:34Z)
AdaDeDup: Adaptive Hybrid Data Pruning for Efficient Large-Scale Object Detection Training [33.01500681857408]
We introduce Adaptive De-Duplication (AdaDeDup), a novel framework that integrates density-based pruning with model-informed feedback in a cluster-adaptive manner.<n>It significantly outperforms prominent baselines, substantially reduces performance degradation, and achieves near-original model performance while pruning 20% of data.
arXiv Detail & Related papers (2025-06-24T22:35:51Z)
Collaborative Unlabeled Data Optimization [6.512302544770766]
This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data.<n>By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines.
arXiv Detail & Related papers (2025-05-20T09:21:40Z)
TSceneJAL: Joint Active Learning of Traffic Scenes for 3D Object Detection [26.059907173437114]
TSceneJAL framework can efficiently sample the balanced, diverse, and complex traffic scenes from both labeled and unlabeled data. Our approach outperforms existing state-of-the-art methods on 3D object detection tasks with up to 12% improvements.
arXiv Detail & Related papers (2024-12-25T11:07:04Z)
PrefixKV: Adaptive Prefix KV Cache is What Vision Instruction-Following Models Need for Efficient Generation [68.71450519846081]
Key-value (KV) cache, necessitated by the lengthy input and output sequences, notably contributes to the high inference cost.<n>We present PrefixKV, which reframes the challenge of determining KV cache sizes for all layers into the task of searching for the optimal global prefix configuration.<n>Our method achieves the state-of-the-art performance compared with others.
arXiv Detail & Related papers (2024-12-04T15:48:59Z)
CARE Transformer: Mobile-Friendly Linear Visual Transformer via Decoupled Dual Interaction [77.8576094863446]
We propose a new detextbfCoupled dutextbfAl-interactive lineatextbfR atttextbfEntion (CARE) mechanism. We first propose an asymmetrical feature decoupling strategy that asymmetrically decouples the learning process for local inductive bias and long-range dependencies. By adopting a decoupled learning way and fully exploiting complementarity across features, our method can achieve both high efficiency and accuracy.
arXiv Detail & Related papers (2024-11-25T07:56:13Z)
Revisiting Cascaded Ensembles for Efficient Inference [32.914852531806]
A common approach to make machine learning inference more efficient is to use example-specific adaptive schemes. In this work we study a simple scheme for adaptive inference. We build a cascade of ensembles (CoE), beginning with resource-efficient models and growing to larger, more expressive models.
arXiv Detail & Related papers (2024-07-02T15:14:12Z)
REP: Resource-Efficient Prompting for On-device Continual Learning [23.92661395403251]
On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical. It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance. We introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods.
arXiv Detail & Related papers (2024-06-07T09:17:33Z)
Empowering HWNs with Efficient Data Labeling: A Clustered Federated Semi-Supervised Learning Approach [2.046985601687158]
Clustered Federated Multitask Learning (CFL) has gained considerable attention as an effective strategy for overcoming statistical challenges. We introduce a novel framework, Clustered Federated Semi-Supervised Learning (CFSL), designed for more realistic HWN scenarios. Our results demonstrate that CFSL significantly improves upon key metrics such as testing accuracy, labeling accuracy, and labeling latency under varying proportions of labeled and unlabeled data.
arXiv Detail & Related papers (2024-01-19T11:47:49Z)
Improved Distribution Matching for Dataset Condensation [91.55972945798531]
We propose a novel dataset condensation method based on distribution matching. Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources.
arXiv Detail & Related papers (2023-07-19T04:07:33Z)
TrueDeep: A systematic approach of crack detection with less data [0.0]
We show that by incorporating domain knowledge along with deep learning architectures, we can achieve similar performance with less data. Our algorithms, developed with 23% of the overall data, have a similar performance on the test data and significantly better performance on multiple blind datasets.
arXiv Detail & Related papers (2023-05-30T14:51:58Z)
SVNet: Where SO(3) Equivariance Meets Binarization on Point Cloud Representation [65.4396959244269]
The paper tackles the challenge by designing a general framework to construct 3D learning architectures. The proposed approach can be applied to general backbones like PointNet and DGCNN. Experiments on ModelNet40, ShapeNet, and the real-world dataset ScanObjectNN, demonstrated that the method achieves a great trade-off between efficiency, rotation, and accuracy.
arXiv Detail & Related papers (2022-09-13T12:12:19Z)
Open-Set Semi-Supervised Learning for 3D Point Cloud Understanding [62.17020485045456]
It is commonly assumed in semi-supervised learning (SSL) that the unlabeled data are drawn from the same distribution as that of the labeled ones. We propose to selectively utilize unlabeled data through sample weighting, so that only conducive unlabeled data would be prioritized.
arXiv Detail & Related papers (2022-05-02T16:09:17Z)
DANCE: DAta-Network Co-optimization for Efficient Segmentation Model Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference. It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity. Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z)
Bandit Data-Driven Optimization [62.01362535014316]
There are four major pain points that a machine learning pipeline must overcome in order to be useful in settings. We introduce bandit data-driven optimization, the first iterative prediction-prescription framework to address these pain points. We propose PROOF, a novel algorithm for this framework and formally prove that it has no-regret.
arXiv Detail & Related papers (2020-08-26T17:50:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.