Watermarking Generative Tabular Data
- URL: http://arxiv.org/abs/2405.14018v1
- Date: Wed, 22 May 2024 21:52:12 GMT
- Title: Watermarking Generative Tabular Data
- Authors: Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, Guang Cheng,
- Abstract summary: We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity.
We also demonstrate appealing robustness against additive noise attack.
- Score: 39.31042783480766
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.
Related papers
- Watermarking Generative Categorical Data [9.087950471621653]
Our method embeds secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other.
To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution.
arXiv Detail & Related papers (2024-11-16T21:57:45Z) - Embedding Watermarks in Diffusion Process for Model Intellectual Property Protection [16.36712147596369]
We introduce a novel watermarking framework by embedding the watermark into the whole diffusion process.
Detailed theoretical analysis and experimental validation demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2024-10-29T18:27:10Z) - Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models [63.450843788680196]
We show that it is impossible to simultaneously maintain the highest watermark strength and the highest sampling efficiency.
We propose two methods that maintain either the sampling efficiency or the watermark strength, but not both.
Our work provides a rigorous theoretical foundation for understanding the inherent trade-off between watermark strength and sampling efficiency.
arXiv Detail & Related papers (2024-10-27T12:00:19Z) - Adaptive and Robust Watermark for Generative Tabular Data [8.566821590631907]
We propose a flexible and robust watermarking mechanism for generative tabular data.
We show theoretically and empirically that the watermarked datasets have negligible impact on the data quality and downstream utility.
arXiv Detail & Related papers (2024-09-23T04:37:30Z) - TabularMark: Watermarking Tabular Datasets for Machine Learning [20.978995194849297]
We propose a hypothesis testing-based watermarking scheme, TabularMark.
Data noise partitioning is utilized for data perturbation during embedding.
Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
arXiv Detail & Related papers (2024-06-21T02:58:45Z) - Domain Watermark: Effective and Harmless Dataset Copyright Protection is
Closed at Hand [96.26251471253823]
backdoor-based dataset ownership verification (DOV) is currently the only feasible approach to protect the copyright of open-source datasets.
We make watermarked models (trained on the protected dataset) correctly classify some hard' samples that will be misclassified by the benign model.
arXiv Detail & Related papers (2023-10-09T11:23:05Z) - Did You Train on My Dataset? Towards Public Dataset Protection with
Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data.
By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders.
This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z) - WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD)
An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images.
The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z) - Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset.
We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.