An Open Source Python Library for Anonymizing Sensitive Data
- URL: http://arxiv.org/abs/2408.10766v1
- Date: Tue, 20 Aug 2024 12:01:57 GMT
- Title: An Open Source Python Library for Anonymizing Sensitive Data
- Authors: Judith Sáinz-Pardo Díaz, Álvaro López García,
- Abstract summary: This paper presents the implementation of a Python library for the anonymization of sensitive tabular data.
The framework provides users with a wide range of anonymization methods that can be applied on the given dataset.
The library has been implemented following best practices for integration and continuous development.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests.
Related papers
- DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z) - Tau-Eval: A Unified Evaluation Framework for Useful and Private Text Anonymization [5.239989658197324]
We present Tau-Eval, an open-source framework for benchmarking text anonymization methods.<n>A Python library, code, documentation and tutorials are publicly available.
arXiv Detail & Related papers (2025-06-06T10:59:35Z) - SKALD: Scalable K-Anonymisation for Large Datasets [4.1034194672472575]
SKALD is a novel algorithm for performing k-anonymisation on large datasets with limited RAM.<n>Our algorithm offers multi-fold performance improvement over standard k-anonymisation methods.
arXiv Detail & Related papers (2025-05-06T13:38:53Z) - Tabular Data Adapters: Improving Outlier Detection for Unlabeled Private Data [12.092540602813333]
We introduce Tabular Data Adapters (TDA), a novel method for generating soft labels for unlabeled data in outlier detection tasks.
Our approach offers a scalable, efficient, and cost-effective solution, to bridge the gap between public research models and real-world industrial applications.
arXiv Detail & Related papers (2025-04-29T15:38:43Z) - Augmenting Anonymized Data with AI: Exploring the Feasibility and Limitations of Large Language Models in Data Enrichment [3.459382629188014]
Large Language Models (LLMs) have demonstrated advanced capabilities in both text generation and comprehension.
Their application to data archives might facilitate the privatization of sensitive information about the data subjects.
This data, if not safeguarded, may bring privacy risks in terms of both disclosure and identification.
arXiv Detail & Related papers (2025-04-03T13:26:59Z) - OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models [70.72097493954067]
Large language models (LLMs) for code have become indispensable in various domains, including code generation, reasoning tasks and agent systems.
While open-access code LLMs are increasingly approaching the performance levels of proprietary models, high-quality code LLMs remain limited.
We introduce OpenCoder, a top-tier code LLM that not only achieves performance comparable to leading models but also serves as an "open cookbook" for the research community.
arXiv Detail & Related papers (2024-11-07T17:47:25Z) - Toxicity of the Commons: Curating Open-Source Pre-Training Data [6.137272725645159]
We propose a data curation pipeline to reduce harmful outputs by models trained on public domain data.
Current state-of-the-art approaches to toxicity filtering are often infeasible or inappropriate for open data models.
We create a custom training dataset, ToxicCommons, which is composed of texts which have been classified across five different dimensions.
arXiv Detail & Related papers (2024-10-29T23:00:05Z) - Introducing a Comprehensive, Continuous, and Collaborative Survey of Intrusion Detection Datasets [2.7082111912355877]
COMIDDS is an effort to comprehensively survey intrusion detection datasets with an unprecedented level of detail.
It provides structured and critical information on each dataset, including actual data samples and links to relevant publications.
arXiv Detail & Related papers (2024-08-05T14:40:41Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Privacy-Preserving Hierarchical Anonymization Framework over Encrypted Data [0.061446808540639365]
This study proposes a hierarchical k-anonymization framework using homomorphic encryption and secret sharing composed of two types of domains.
The experimental results show that connecting two domains can accelerate the anonymization process, indicating that the proposed secure hierarchical architecture is practical and efficient.
arXiv Detail & Related papers (2023-10-19T01:08:37Z) - Stop Uploading Test Data in Plain Text: Practical Strategies for
Mitigating Data Contamination by Evaluation Benchmarks [70.39633252935445]
Data contamination has become prevalent and challenging with the rise of models pretrained on large automatically-crawled corpora.
For closed models, the training data becomes a trade secret, and even for open models, it is not trivial to detect contamination.
We propose three strategies that can make a difference: (1) Test data made public should be encrypted with a public key and licensed to disallow derivative distribution; (2) demand training exclusion controls from closed API holders, and protect your test data by refusing to evaluate without them; and (3) avoid data which appears with its solution on the internet, and release the web-page context of internet-derived
arXiv Detail & Related papers (2023-05-17T12:23:38Z) - Reinforcement Learning on Encrypted Data [58.39270571778521]
We present a preliminary, experimental study of how a DQN agent trained on encrypted states performs in environments with discrete and continuous state spaces.
Our results highlight that the agent is still capable of learning in small state spaces even in presence of non-deterministic encryption, but performance collapses in more complex environments.
arXiv Detail & Related papers (2021-09-16T21:59:37Z) - OpenCoS: Contrastive Semi-supervised Learning for Handling Open-set
Unlabeled Data [65.19205979542305]
Unlabeled data may include out-of-class samples in practice.
OpenCoS is a method for handling this realistic semi-supervised learning scenario.
arXiv Detail & Related papers (2021-06-29T06:10:05Z) - Secure Sum Outperforms Homomorphic Encryption in (Current) Collaborative
Deep Learning [7.690774882108066]
We discuss methods for training neural networks on the joint data of different data owners, that keep each party's input confidential.
We show that a less complex and computationally less expensive secure sum protocol exhibits superior properties in terms of both collusion-resistance and runtime.
arXiv Detail & Related papers (2020-06-02T23:03:32Z) - TIPRDC: Task-Independent Privacy-Respecting Data Crowdsourcing Framework
for Deep Learning with Anonymized Intermediate Representations [49.20701800683092]
We present TIPRDC, a task-independent privacy-respecting data crowdsourcing framework with anonymized intermediate representation.
The goal of this framework is to learn a feature extractor that can hide the privacy information from the intermediate representations; while maximally retaining the original information embedded in the raw data for the data collector to accomplish unknown learning tasks.
arXiv Detail & Related papers (2020-05-23T06:21:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.