Diverse Community Data for Benchmarking Data Privacy Algorithms
- URL: http://arxiv.org/abs/2306.13216v3
- Date: Tue, 31 Oct 2023 19:50:36 GMT
- Title: Diverse Community Data for Benchmarking Data Privacy Algorithms
- Authors: Aniruddha Sen, Christine Task, Dhruv Kapur, Gary Howarth, Karan Bhagat
- Abstract summary: The Collaborative Research Cycle (CRC) is a National Institute of Standards and Technology (NIST) benchmarking program.
Deidentification algorithms are vulnerable to the same bias and privacy issues that impact other data analytics and machine learning applications.
This paper summarizes four CRC contributions on the relationship between diverse populations and challenges for equitable deidentification.
- Score: 0.2999888908665658
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Collaborative Research Cycle (CRC) is a National Institute of Standards
and Technology (NIST) benchmarking program intended to strengthen understanding
of tabular data deidentification technologies. Deidentification algorithms are
vulnerable to the same bias and privacy issues that impact other data analytics
and machine learning applications, and can even amplify those issues by
contaminating downstream applications. This paper summarizes four CRC
contributions: theoretical work on the relationship between diverse populations
and challenges for equitable deidentification; public benchmark data focused on
diverse populations and challenging features; a comprehensive open source suite
of evaluation metrology for deidentified datasets; and an archive of more than
450 deidentified data samples from a broad range of techniques. The initial set
of evaluation results demonstrate the value of these tools for investigations
in this field.
Related papers
- Tabular Data Synthesis with Differential Privacy: A Survey [24.500349285858597]
Data sharing is a prerequisite for collaborative innovation, enabling organizations to leverage diverse datasets for deeper insights.
Data synthesis tackles this by generating artificial datasets that preserve the statistical characteristics of real data.
Differentially private data synthesis has emerged as a promising approach to privacy-aware data sharing.
arXiv Detail & Related papers (2024-11-04T06:32:48Z) - Comprehensive Review and Empirical Evaluation of Causal Discovery Algorithms for Numerical Data [3.9523536371670045]
Causal analysis has become an essential component in understanding the underlying causes of phenomena across various fields.
Existing literature on causal discovery algorithms is fragmented, with inconsistent methodologies.
A lack of comprehensive evaluations, i.e., data characteristics are often ignored to be jointly analyzed when benchmarking algorithms.
arXiv Detail & Related papers (2024-07-17T23:47:05Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - On the Cross-Dataset Generalization of Machine Learning for Network
Intrusion Detection [50.38534263407915]
Network Intrusion Detection Systems (NIDS) are a fundamental tool in cybersecurity.
Their ability to generalize across diverse networks is a critical factor in their effectiveness and a prerequisite for real-world applications.
In this study, we conduct a comprehensive analysis on the generalization of machine-learning-based NIDS through an extensive experimentation in a cross-dataset framework.
arXiv Detail & Related papers (2024-02-15T14:39:58Z) - A Survey on Causal Discovery Methods for I.I.D. and Time Series Data [4.57769506869942]
Causal Discovery (CD) algorithms can identify the cause-effect relationships among the variables of a system from related observational data.
We present an extensive discussion on the methods designed to perform causal discovery from both independent and identically distributed (I.I.D.) data and time series data.
arXiv Detail & Related papers (2023-03-27T09:21:41Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - Detection and Evaluation of Clusters within Sequential Data [58.720142291102135]
Clustering algorithms for Block Markov Chains possess theoretical optimality guarantees.
In particular, our sequential data is derived from human DNA, written text, animal movement data and financial markets.
It is found that the Block Markov Chain model assumption can indeed produce meaningful insights in exploratory data analyses.
arXiv Detail & Related papers (2022-10-04T15:22:39Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Algorithmic Fairness Datasets: the Story so Far [68.45921483094705]
Data-driven algorithms are studied in diverse domains to support critical decisions, directly impacting people's well-being.
A growing community of researchers has been investigating the equity of existing algorithms and proposing novel ones, advancing the understanding of risks and opportunities of automated decision-making for historically disadvantaged populations.
Progress in fair Machine Learning hinges on data, which can be appropriately used only if adequately documented.
Unfortunately, the algorithmic fairness community suffers from a collective data documentation debt caused by a lack of information on specific resources (opacity) and scatteredness of available information (sparsity)
arXiv Detail & Related papers (2022-02-03T17:25:46Z) - Comparative Analysis of Extreme Verification Latency Learning Algorithms [3.3439097577935213]
This paper is a comprehensive survey and comparative analysis of some of the EVL algorithms to point out the weaknesses and strengths of different approaches.
This work is a very first effort to provide a review of some of the existing algorithms in this field to the research community.
arXiv Detail & Related papers (2020-11-26T16:34:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.