SKALD: Scalable K-Anonymisation for Large Datasets
- URL: http://arxiv.org/abs/2505.03529v2
- Date: Tue, 01 Jul 2025 10:09:57 GMT
- Title: SKALD: Scalable K-Anonymisation for Large Datasets
- Authors: Kailash Reddy, Novoneel Chakraborty, Amogh Dharmavaram, Anshoo Tandon,
- Abstract summary: SKALD is a novel algorithm for performing k-anonymisation on large datasets with limited RAM.<n>Our algorithm offers multi-fold performance improvement over standard k-anonymisation methods.
- Score: 4.1034194672472575
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Data privacy and anonymisation are critical concerns in today's data-driven society, particularly when handling personal and sensitive user data. Regulatory frameworks worldwide recommend privacy-preserving protocols such as k-anonymisation to de-identify releases of tabular data. Available hardware resources provide an upper bound on the maximum size of dataset that can be processed at a time. Large datasets with sizes exceeding this upper bound must be broken up into smaller data chunks for processing. In these cases, standard k-anonymisation tools such as ARX can only operate on a per-chunk basis. This paper proposes SKALD, a novel algorithm for performing k-anonymisation on large datasets with limited RAM. Our SKALD algorithm offers multi-fold performance improvement over standard k-anonymisation methods by extracting and combining sufficient statistics from each chunk during processing to ensure successful k-anonymisation while providing better utility.
Related papers
- Improving Noise Efficiency in Privacy-preserving Dataset Distillation [59.57846442477106]
We introduce a novel framework that decouples sampling from optimization for better convergence and improves signal quality.<n>On CIFAR-10, our method achieves a textbf10.0% improvement with 50 images per class and textbf8.3% increase with just textbfone-fifth the distilled set size of previous state-of-the-art methods.
arXiv Detail & Related papers (2025-08-03T13:15:52Z) - Scalable contribution bounding to achieve privacy [62.6490768306231]
In modern datasets, enforcing user-level privacy requires capping each user's total contribution.<n>Existing algorithms for this task are computationally intensive and do not scale to the massive datasets prevalent today.<n>Our approach models the complex ownership structure as a hypergraph, where users are vertices and records are hyperedges.<n>A record is added to the final dataset only if all its owners unanimously agree, thereby ensuring that no user's predefined contribution limit is violated.
arXiv Detail & Related papers (2025-07-31T11:14:17Z) - Leveraging Vertical Public-Private Split for Improved Synthetic Data Generation [9.819636361032256]
Differentially Private Synthetic Data Generation is a key enabler of private and secure data sharing.<n>Recent literature has explored scenarios where a small amount of public data is used to help enhance the quality of synthetic data.<n>We propose a novel framework that adapts horizontal public-assisted methods into the vertical setting.
arXiv Detail & Related papers (2025-04-15T08:59:03Z) - Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs [20.774525687291167]
We propose a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale finetuning.<n> CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data.<n>To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram.
arXiv Detail & Related papers (2025-03-16T04:00:32Z) - A CLIP-Powered Framework for Robust and Generalizable Data Selection [51.46695086779598]
Real-world datasets often contain redundant and noisy data, imposing a negative impact on training efficiency and model performance.
Data selection has shown promise in identifying the most representative samples from the entire dataset.
We propose a novel CLIP-powered data selection framework that leverages multimodal information for more robust and generalizable sample selection.
arXiv Detail & Related papers (2024-10-15T03:00:58Z) - An Open Source Python Library for Anonymizing Sensitive Data [0.0]
This paper presents the implementation of a Python library for the anonymization of sensitive tabular data.
The framework provides users with a wide range of anonymization methods that can be applied on the given dataset.
The library has been implemented following best practices for integration and continuous development.
arXiv Detail & Related papers (2024-08-20T12:01:57Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - A Trajectory K-Anonymity Model Based on Point Density and Partition [0.0]
This paper develops a trajectory K-anonymity model based on Point Density and Partition (K PDP)
It successfully resists re-identification attacks and reduces the data utility loss of the k-anonymized dataset.
arXiv Detail & Related papers (2023-07-31T17:10:56Z) - Large-scale Fully-Unsupervised Re-Identification [78.47108158030213]
We propose two strategies to learn from large-scale unlabeled data.
The first strategy performs a local neighborhood sampling to reduce the dataset size in each without violating neighborhood relationships.
A second strategy leverages a novel Re-Ranking technique, which has a lower time upper bound complexity and reduces the memory complexity from O(n2) to O(kn) with k n.
arXiv Detail & Related papers (2023-07-26T16:19:19Z) - LargeST: A Benchmark Dataset for Large-Scale Traffic Forecasting [65.71129509623587]
Road traffic forecasting plays a critical role in smart city initiatives and has experienced significant advancements thanks to the power of deep learning.
However, the promising results achieved on current public datasets may not be applicable to practical scenarios.
We introduce the LargeST benchmark dataset, which includes a total of 8,600 sensors in California with a 5-year time coverage.
arXiv Detail & Related papers (2023-06-14T05:48:36Z) - SEAM: Searching Transferable Mixed-Precision Quantization Policy through
Large Margin Regularization [50.04951511146338]
Mixed-precision quantization (MPQ) suffers from the time-consuming process of searching the optimal bit-width allocation for each layer.
This paper proposes a novel method for efficiently searching for effective MPQ policies using a small proxy dataset.
arXiv Detail & Related papers (2023-02-14T05:47:45Z) - Scotch: An Efficient Secure Computation Framework for Secure Aggregation [0.0]
Federated learning enables multiple data owners to jointly train a machine learning model without revealing their private datasets.
A malicious aggregation server might use the model parameters to derive sensitive information about the training dataset used.
We propose textscScotch, a decentralized textitm-party secure-computation framework for federated aggregation.
arXiv Detail & Related papers (2022-01-19T17:16:35Z) - Meta Clustering Learning for Large-scale Unsupervised Person
Re-identification [124.54749810371986]
We propose a "small data for big task" paradigm dubbed Meta Clustering Learning (MCL)
MCL only pseudo-labels a subset of the entire unlabeled data via clustering to save computing for the first-phase training.
Our method significantly saves computational cost while achieving a comparable or even better performance compared to prior works.
arXiv Detail & Related papers (2021-11-19T04:10:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.