Related papers: Watermarking Generative Tabular Data

Watermarking Generative Tabular Data

URL: http://arxiv.org/abs/2405.14018v1
Date: Wed, 22 May 2024 21:52:12 GMT
Title: Watermarking Generative Tabular Data
Authors: Hengzhi He, Peiyu Yu, Junpeng Ren, Ying Nian Wu, Guang Cheng,
Abstract summary: We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity. We also demonstrate appealing robustness against additive noise attack.
Score: 39.31042783480766
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, we introduce a simple yet effective tabular data watermarking mechanism with statistical guarantees. We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity, and also demonstrates appealing robustness against additive noise attack. The general idea is to achieve the watermarking through a strategic embedding based on simple data binning. Specifically, it divides the feature's value range into finely segmented intervals and embeds watermarks into selected ``green list" intervals. To detect the watermarks, we develop a principled statistical hypothesis-testing framework with minimal assumptions: it remains valid as long as the underlying data distribution has a continuous density function. The watermarking efficacy is demonstrated through rigorous theoretical analysis and empirical validation, highlighting its utility in enhancing the security of synthetic and real-world datasets.

Related papers

Dataset Protection via Watermarked Canaries in Retrieval-Augmented LLMs [67.0310240737424]
We introduce a novel approach to safeguard the ownership of text datasets and effectively detect unauthorized use by the RA-LLMs. Our approach preserves the original data completely unchanged while protecting it by inserting specifically designed canary documents into the IP dataset. During the detection process, unauthorized usage is identified by querying the canary documents and analyzing the responses of RA-LLMs.
arXiv Detail & Related papers (2025-02-15T04:56:45Z)
Watermarking Generative Categorical Data [9.087950471621653]
Our method embeds secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution.
arXiv Detail & Related papers (2024-11-16T21:57:45Z)
Embedding Watermarks in Diffusion Process for Model Intellectual Property Protection [16.36712147596369]
We introduce a novel watermarking framework by embedding the watermark into the whole diffusion process. Detailed theoretical analysis and experimental validation demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2024-10-29T18:27:10Z)
Inevitable Trade-off between Watermark Strength and Speculative Sampling Efficiency for Language Models [63.450843788680196]
We show that it is impossible to simultaneously maintain the highest watermark strength and the highest sampling efficiency. We propose two methods that maintain either the sampling efficiency or the watermark strength, but not both. Our work provides a rigorous theoretical foundation for understanding the inherent trade-off between watermark strength and sampling efficiency.
arXiv Detail & Related papers (2024-10-27T12:00:19Z)
Adaptive and Robust Watermark for Generative Tabular Data [8.566821590631907]
We propose a flexible and robust watermarking mechanism for generative tabular data. We show theoretically and empirically that the watermarked datasets have negligible impact on the data quality and downstream utility.
arXiv Detail & Related papers (2024-09-23T04:37:30Z)
TabularMark: Watermarking Tabular Datasets for Machine Learning [20.978995194849297]
We propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
arXiv Detail & Related papers (2024-06-21T02:58:45Z)
TokenMark: A Modality-Agnostic Watermark for Pre-trained Transformers [67.57928750537185]
TokenMark is a robust, modality-agnostic, robust watermarking system for pre-trained models. It embeds the watermark by fine-tuning the pre-trained model on a set of specifically permuted data samples. It significantly improves the robustness, efficiency, and universality of model watermarking.
arXiv Detail & Related papers (2024-03-09T08:54:52Z)
Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand [96.26251471253823]
backdoor-based dataset ownership verification (DOV) is currently the only feasible approach to protect the copyright of open-source datasets. We make watermarked models (trained on the protected dataset) correctly classify some hard' samples that will be misclassified by the benign model.
arXiv Detail & Related papers (2023-10-09T11:23:05Z)
Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data. By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders. This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z)
WSSOD: A New Pipeline for Weakly- and Semi-Supervised Object Detection [75.80075054706079]
We propose a weakly- and semi-supervised object detection framework (WSSOD) An agent detector is first trained on a joint dataset and then used to predict pseudo bounding boxes on weakly-annotated images. The proposed framework demonstrates remarkable performance on PASCAL-VOC and MSCOCO benchmark, achieving a high performance comparable to those obtained in fully-supervised settings.
arXiv Detail & Related papers (2021-05-21T11:58:50Z)
Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset. We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.