Related papers: Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections

URL: http://arxiv.org/abs/2306.07884v2
Date: Fri, 24 May 2024 18:55:41 GMT
Title: Continual Release of Differentially Private Synthetic Data from Longitudinal Data Collections
Authors: Mark Bun, Marco Gaboardi, Marcel Neunhoeffer, Wanrong Zhang,
Abstract summary: We study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element. We give continual synthetic data generation algorithms that preserve two basic types of queries.
Score: 19.148874215745135
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Motivated by privacy concerns in long-term longitudinal studies in medical and social science research, we study the problem of continually releasing differentially private synthetic data from longitudinal data collections. We introduce a model where, in every time step, each individual reports a new data element, and the goal of the synthesizer is to incrementally update a synthetic dataset in a consistent way to capture a rich class of statistical properties. We give continual synthetic data generation algorithms that preserve two basic types of queries: fixed time window queries and cumulative time queries. We show nearly tight upper bounds on the error rates of these algorithms and demonstrate their empirical performance on realistically sized datasets from the U.S. Census Bureau's Survey of Income and Program Participation.

Related papers

Understanding the Influence of Synthetic Data for Text Embedders [52.04771455432998]
We first reproduce and publicly release the synthetic data proposed by Wang et al.<n>We critically examine where exactly synthetic data improves model generalization.<n>Our findings highlight the limitations of current synthetic data approaches for building general-purpose embedders.
arXiv Detail & Related papers (2025-09-07T19:28:52Z)
SynDelay: A Synthetic Dataset for Delivery Delay Prediction [50.56729406793283]
We present SynDelay, a synthetic dataset designed for delivery delay prediction.<n>It is publicly available through the Supply Chain Data Hub, an open initiative promoting dataset sharing and benchmarking in supply chain AI.
arXiv Detail & Related papers (2025-08-30T21:54:37Z)
TimeGraph: Synthetic Benchmark Datasets for Robust Time-Series Causal Discovery [4.07304559469381]
We introduce TimeGraph, a comprehensive suite of synthetic time-series benchmark datasets.<n>Each dataset is accompanied by a fully specified causal graph featuring varying densities and diverse noise distributions.<n>We demonstrate the utility of TimeGraph through systematic evaluations of state-of-the-art causal discovery algorithms.
arXiv Detail & Related papers (2025-06-02T06:34:11Z)
Tackling Data Heterogeneity in Federated Time Series Forecasting [61.021413959988216]
Time series forecasting plays a critical role in various real-world applications, including energy consumption prediction, disease transmission monitoring, and weather forecasting. Most existing methods rely on a centralized training paradigm, where large amounts of data are collected from distributed devices to a central cloud server. We propose a novel framework, Fed-TREND, to address data heterogeneity by generating informative synthetic data as auxiliary knowledge carriers.
arXiv Detail & Related papers (2024-11-24T04:56:45Z)
Exploring the Landscape for Generative Sequence Models for Specialized Data Synthesis [0.0]
This paper introduces a novel approach that leverages three generative models of varying complexity to synthesize Malicious Network Traffic. Our approach transforms numerical data into text, re-framing data generation as a language modeling task. Our method surpasses state-of-the-art generative models in producing high-fidelity synthetic data.
arXiv Detail & Related papers (2024-11-04T09:51:10Z)
Are Synthetic Time-series Data Really not as Good as Real Data? [29.852306720544224]
Time-series data presents limitations stemming from data quality issues, bias and vulnerabilities, and generalization problem. We introduce InfoBoost -- a highly versatile cross-domain data synthesizing framework with time series representation learning capability. We have developed a method based on synthetic data that enables model training without the need for real data, surpassing the performance of models trained with real data.
arXiv Detail & Related papers (2024-02-01T13:59:04Z)
Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark [56.8042116967334]
Synthetic data serves as an alternative in training machine learning models. ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper explores the potential of integrating data-centric AI techniques to guide the synthetic data generation process.
arXiv Detail & Related papers (2023-10-25T20:32:02Z)
Differentially Private Synthetic Data Using KD-Trees [11.96971298978997]
We exploit space partitioning techniques together with noise perturbation and thus achieve intuitive and transparent algorithms. We propose both data independent and data dependent algorithms for $epsilon$-differentially private synthetic data generation. We show empirical utility improvements over the prior work, and discuss performance of our algorithm on a downstream classification task on a real dataset.
arXiv Detail & Related papers (2023-06-19T17:08:32Z)
TSGM: A Flexible Framework for Generative Modeling of Synthetic Time Series [61.436361263605114]
Time series data are often scarce or highly sensitive, which precludes the sharing of data between researchers and industrial organizations. We introduce Time Series Generative Modeling (TSGM), an open-source framework for the generative modeling of synthetic time series.
arXiv Detail & Related papers (2023-05-19T10:11:21Z)
Synthetic data generation for a longitudinal cohort study -- Evaluation, method extension and reproduction of published data analysis results [0.32593385688760446]
In the health sector, access to individual-level data is often challenging due to privacy concerns. A promising alternative is the generation of fully synthetic data. In this study, we use a state-of-the-art synthetic data generation method.
arXiv Detail & Related papers (2023-05-12T13:13:55Z)
Beyond Privacy: Navigating the Opportunities and Challenges of Synthetic Data [91.52783572568214]
Synthetic data may become a dominant force in the machine learning world, promising a future where datasets can be tailored to individual needs. We discuss which fundamental challenges the community needs to overcome for wider relevance and application of synthetic data.
arXiv Detail & Related papers (2023-04-07T16:38:40Z)
Grouped self-attention mechanism for a memory-efficient Transformer [64.0125322353281]
Real-world tasks such as forecasting weather, electricity consumption, and stock market involve predicting data that vary over time. Time-series data are generally recorded over a long period of observation with long sequences owing to their periodic characteristics and long-range dependencies over time. We propose two novel modules, Grouped Self-Attention (GSA) and Compressed Cross-Attention (CCA) Our proposed model efficiently exhibited reduced computational complexity and performance comparable to or better than existing methods.
arXiv Detail & Related papers (2022-10-02T06:58:49Z)
Private Synthetic Data with Hierarchical Structure [33.72123440111452]
We study the problem of differentially private synthetic data generation for hierarchical datasets in which individual data points are grouped together. In particular, to measure the similarity between the synthetic dataset and the underlying private one, we frame our objective under the problem of private query release. We introduce private synthetic data algorithms for hierarchical query release and evaluate them on hierarchical datasets.
arXiv Detail & Related papers (2022-06-13T07:22:21Z)
Hide-and-Seek Privacy Challenge [88.49671206936259]
The NeurIPS 2020 Hide-and-Seek Privacy Challenge is a novel two-tracked competition to accelerate progress in tackling both problems. In our head-to-head format, participants in the synthetic data generation track (i.e. "hiders") and the patient re-identification track (i.e. "seekers") are directly pitted against each other by way of a new, high-quality intensive care time-series dataset.
arXiv Detail & Related papers (2020-07-23T15:50:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.