Adaptive and Robust Watermark for Generative Tabular Data
- URL: http://arxiv.org/abs/2409.14700v1
- Date: Mon, 23 Sep 2024 04:37:30 GMT
- Title: Adaptive and Robust Watermark for Generative Tabular Data
- Authors: Dung Daniel Ngo, Daniel Scott, Saheed Obitayo, Vamsi K. Potluru, Manuela Veloso,
- Abstract summary: We propose a flexible and robust watermarking mechanism for generative tabular data.
We show theoretically and empirically that the watermarked datasets have negligible impact on the data quality and downstream utility.
- Score: 8.566821590631907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in generative models have demonstrated its ability to create high-quality synthetic data. However, the pervasiveness of synthetic content online also brings forth growing concerns that it can be used for malicious purposes. To ensure the authenticity of the data, watermarking techniques have recently emerged as a promising solution due to their strong statistical guarantees. In this paper, we propose a flexible and robust watermarking mechanism for generative tabular data. Specifically, a data provider with knowledge of the downstream tasks can partition the feature space into pairs of $(key, value)$ columns. Within each pair, the data provider first uses elements in the $key$ column to generate a randomized set of ''green'' intervals, then encourages elements of the $value$ column to be in one of these ''green'' intervals. We show theoretically and empirically that the watermarked datasets (i) have negligible impact on the data quality and downstream utility, (ii) can be efficiently detected, and (iii) are robust against multiple attacks commonly observed in data science.
Related papers
- Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Detecting Dataset Abuse in Fine-Tuning Stable Diffusion Models for Text-to-Image Synthesis [3.8809673918404246]
dataset watermarking framework designed to detect unauthorized usage and trace data leaks.
We present a dataset watermarking framework designed to detect unauthorized usage and trace data leaks.
arXiv Detail & Related papers (2024-09-27T16:34:48Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Watermarking Generative Tabular Data [39.31042783480766]
We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity.
We also demonstrate appealing robustness against additive noise attack.
arXiv Detail & Related papers (2024-05-22T21:52:12Z) - Scaling Laws for Data Filtering -- Data Curation cannot be Compute Agnostic [99.3682210827572]
Vision-language models (VLMs) are trained for thousands of GPU hours on carefully curated web datasets.
Data curation strategies are typically developed agnostic of the available compute for training.
We introduce neural scaling laws that account for the non-homogeneous nature of web data.
arXiv Detail & Related papers (2024-04-10T17:27:54Z) - FreqyWM: Frequency Watermarking for the New Data Economy [8.51675079658644]
We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark.
We develop optimal as well as fast algorithms for creating and verifying such watermarks.
arXiv Detail & Related papers (2023-12-27T12:17:59Z) - Domain Watermark: Effective and Harmless Dataset Copyright Protection is
Closed at Hand [96.26251471253823]
backdoor-based dataset ownership verification (DOV) is currently the only feasible approach to protect the copyright of open-source datasets.
We make watermarked models (trained on the protected dataset) correctly classify some hard' samples that will be misclassified by the benign model.
arXiv Detail & Related papers (2023-10-09T11:23:05Z) - Benchmarking and Analyzing Generative Data for Visual Recognition [66.55174903469722]
This work delves into the impact of generative images, primarily comparing paradigms that harness external data.
We devise textbfGenBench, a benchmark comprising 22 datasets with 2548 categories, to appraise generative data across various visual recognition tasks.
Our exhaustive benchmark and analysis spotlight generative data's promise in visual recognition, while identifying key challenges for future investigation.
arXiv Detail & Related papers (2023-07-25T17:59:59Z) - HydraGAN A Multi-head, Multi-objective Approach to Synthetic Data
Generation [8.260059020010454]
We introduce HydraGAN, a new approach to synthetic data generation that introduces multiple generator and discriminator agents into the system.
We observe that HydraGAN outperforms baseline methods for three datasets for multiple criteria of maximizing data realism, maximizing model accuracy, and minimizing re-identification risk.
arXiv Detail & Related papers (2021-11-13T02:19:11Z) - Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset.
We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z) - Synthesizing Property & Casualty Ratemaking Datasets using Generative
Adversarial Networks [2.2649197740853677]
We show how to design three types of generative adversarial networks (GANs) that can build a synthetic insurance dataset from a confidential original dataset.
For transparency, the approaches are illustrated using a public dataset, the French motor third party liability data.
We find that the MC-WGAN-GP synthesizes the best data, the CTGAN is the easiest to use, and the MNCDP-GAN guarantees differential privacy.
arXiv Detail & Related papers (2020-08-13T21:02:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.