Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets
for Public Use
- URL: http://arxiv.org/abs/2101.05093v1
- Date: Wed, 13 Jan 2021 14:24:20 GMT
- Title: Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets
for Public Use
- Authors: Brian Lee, Brandi Dupervil, Nicholas P. Deputy, Wil Duck, Stephen
Soroka, Lyndsay Bottichio, Benjamin Silk, Jason Price, Patricia Sweeney,
Jennifer Fuld, Todd Weber, Dan Pollock
- Abstract summary: CDC has collected person-level, de-identified data from jurisdictions and currently has over 8 million records.
Data elements were included based on the usefulness, public request, and privacy implications.
Specific field values were suppressed to reduce risk of reidentification and exposure of confidential information.
- Score: 0.4462475518267084
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Objectives: Federal open data initiatives that promote increased sharing of
federally collected data are important for transparency, data quality, trust,
and relationships with the public and state, tribal, local, and territorial
(STLT) partners. These initiatives advance understanding of health conditions
and diseases by providing data to more researchers, scientists, and
policymakers for analysis, collaboration, and valuable use outside CDC
responders. This is particularly true for emerging conditions such as COVID-19
where we have much to learn and have evolving data needs. Since the beginning
of the outbreak, CDC has collected person-level, de-identified data from
jurisdictions and currently has over 8 million records, increasing each day.
This paper describes how CDC designed and produces two de-identified public
datasets from these collected data.
Materials and Methods: Data elements were included based on the usefulness,
public request, and privacy implications; specific field values were suppressed
to reduce risk of reidentification and exposure of confidential information.
Datasets were created and verified for privacy and confidentiality using data
management platform analytic tools as well as R scripts.
Results: Unrestricted data are available to the public through Data.CDC.gov
and restricted data, with additional fields, are available with a data use
agreement through a private repository on GitHub.com.
Practice Implications: Enriched understanding of the available public data,
the methods used to create these data, and the algorithms used to protect
privacy of de-identified individuals allow for improved data use. Automating
data generation procedures allows greater and more timely sharing of data.
Related papers
- DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing [0.8739101659113155]
We introduce an effective data publishing algorithm emphDP-CDA.
Our proposed algorithm generates synthetic datasets by randomly mixing data in a class-specific manner, and inducing carefully-tuned randomness to ensure privacy guarantees.
Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by traditional data publishing algorithms, even when subject to the same privacy requirements.
arXiv Detail & Related papers (2024-11-25T06:14:06Z) - Mitigating the Privacy Issues in Retrieval-Augmented Generation (RAG) via Pure Synthetic Data [51.41288763521186]
Retrieval-augmented generation (RAG) enhances the outputs of language models by integrating relevant information retrieved from external knowledge sources.
RAG systems may face severe privacy risks when retrieving private data.
We propose using synthetic data as a privacy-preserving alternative for the retrieval data.
arXiv Detail & Related papers (2024-06-20T22:53:09Z) - Privacy-Preserving Data Sharing in Agriculture: Enforcing Policy Rules
for Secure and Confidential Data Synthesis [0.0]
The use of Big Data in farming requires the collection and analysis of data from various sources such as sensors, satellites, and farmer surveys.
There is significant concern regarding the security of this data as well as the privacy of the participants.
Deep learning-based synthetic data generation has been proposed for privacy-preserving data sharing.
We propose a novel framework for enforcing privacy policy rules in privacy-preserving data generation algorithms.
arXiv Detail & Related papers (2023-11-27T00:12:47Z) - Preserving The Safety And Confidentiality Of Data Mining Information In Health Care: A literature review [0.0]
PPDM technique enables the extraction of actionable insight from enormous volume of data.
Disclosing sensitive information infringes on patients' privacy.
This paper aims to conduct a review of related work on privacy-preserving mechanisms, data protection regulations, and mitigating tactics.
arXiv Detail & Related papers (2023-10-30T05:32:15Z) - A Unified View of Differentially Private Deep Generative Modeling [60.72161965018005]
Data with privacy concerns comes with stringent regulations that frequently prohibited data access and data sharing.
Overcoming these obstacles is key for technological progress in many real-world application scenarios that involve privacy sensitive data.
Differentially private (DP) data publishing provides a compelling solution, where only a sanitized form of the data is publicly released.
arXiv Detail & Related papers (2023-09-27T14:38:16Z) - More Data Types More Problems: A Temporal Analysis of Complexity,
Stability, and Sensitivity in Privacy Policies [0.0]
Data brokers and data processors are part of a multi-billion-dollar industry that profits from collecting, buying, and selling consumer data.
Yet there is little transparency in the data collection industry which makes it difficult to understand what types of data are being collected, used, and sold.
arXiv Detail & Related papers (2023-02-17T15:21:24Z) - Certified Data Removal in Sum-Product Networks [78.27542864367821]
Deleting the collected data is often insufficient to guarantee data privacy.
UnlearnSPN is an algorithm that removes the influence of single data points from a trained sum-product network.
arXiv Detail & Related papers (2022-10-04T08:22:37Z) - Releasing survey microdata with exact cluster locations and additional
privacy safeguards [77.34726150561087]
We propose an alternative microdata dissemination strategy that leverages the utility of the original microdata with additional privacy safeguards.
Our strategy reduces the respondents' re-identification risk for any number of disclosed attributes by 60-80% even under re-identification attempts.
arXiv Detail & Related papers (2022-05-24T19:37:11Z) - Distributed Machine Learning and the Semblance of Trust [66.1227776348216]
Federated Learning (FL) allows the data owner to maintain data governance and perform model training locally without having to share their data.
FL and related techniques are often described as privacy-preserving.
We explain why this term is not appropriate and outline the risks associated with over-reliance on protocols that were not designed with formal definitions of privacy in mind.
arXiv Detail & Related papers (2021-12-21T08:44:05Z) - Second layer data governance for permissioned blockchains: the privacy
management challenge [58.720142291102135]
In pandemic situations, such as the COVID-19 and Ebola outbreak, the action related to sharing health data is crucial to avoid the massive infection and decrease the number of deaths.
In this sense, permissioned blockchain technology emerges to empower users to get their rights providing data ownership, transparency, and security through an immutable, unified, and distributed database ruled by smart contracts.
arXiv Detail & Related papers (2020-10-22T13:19:38Z) - Utility-aware Privacy-preserving Data Releasing [7.462336024223669]
We propose a two-step perturbation-based privacy-preserving data releasing framework.
First, certain predefined privacy and utility problems are learned from the public domain data.
We then leverage the learned knowledge to precisely perturb the data owners' data into privatized data.
arXiv Detail & Related papers (2020-05-09T05:32:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.