Watermarking Generative Categorical Data
- URL: http://arxiv.org/abs/2411.10898v1
- Date: Sat, 16 Nov 2024 21:57:45 GMT
- Title: Watermarking Generative Categorical Data
- Authors: Bochao Gu, Hengzhi He, Guang Cheng,
- Abstract summary: Our method embeds secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other.
To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution.
- Score: 9.087950471621653
- License:
- Abstract: In this paper, we propose a novel statistical framework for watermarking generative categorical data. Our method systematically embeds pre-agreed secret signals by splitting the data distribution into two components and modifying one distribution based on a deterministic relationship with the other, ensuring the watermark is embedded at the distribution-level. To verify the watermark, we introduce an insertion inverse algorithm and detect its presence by measuring the total variation distance between the inverse-decoded data and the original distribution. Unlike previous categorical watermarking methods, which primarily focus on embedding watermarks into a given dataset, our approach operates at the distribution-level, allowing for verification from a statistical distributional perspective. This makes it particularly well-suited for the modern paradigm of synthetic data generation, where the underlying data distribution, rather than specific data points, is of primary importance. The effectiveness of our method is demonstrated through both theoretical analysis and empirical validation.
Related papers
- Watermarking Generative Tabular Data [39.31042783480766]
We show theoretically that the proposed watermark can be effectively detected, while faithfully preserving the data fidelity.
We also demonstrate appealing robustness against additive noise attack.
arXiv Detail & Related papers (2024-05-22T21:52:12Z) - Collaborative Heterogeneous Causal Inference Beyond Meta-analysis [68.4474531911361]
We propose a collaborative inverse propensity score estimator for causal inference with heterogeneous data.
Our method shows significant improvements over the methods based on meta-analysis when heterogeneity increases.
arXiv Detail & Related papers (2024-04-24T09:04:36Z) - Dataset Distillation via the Wasserstein Metric [35.32856617593164]
We introduce the Wasserstein distance, a metric grounded in optimal transport theory, to enhance distribution matching in dataset distillation.
Our method achieves new state-of-the-art performance across a range of high-resolution datasets.
arXiv Detail & Related papers (2023-11-30T13:15:28Z) - Semantic Equivariant Mixup [54.734054770032934]
Mixup is a well-established data augmentation technique, which can extend the training distribution and regularize the neural networks.
Previous mixup variants tend to over-focus on the label-related information.
We propose a semantic equivariant mixup (sem) to preserve richer semantic information in the input.
arXiv Detail & Related papers (2023-08-12T03:05:53Z) - Restricted Generative Projection for One-Class Classification and
Anomaly Detection [31.173234437065464]
We learn a mapping to transform the unknown distribution of training (normal) data to a known target distribution.
The simplicity is to ensure that we can sample from the distribution easily.
The compactness is to ensure that the decision boundary between normal data and abnormal data is clear.
The informativeness is to ensure that the transformed data preserve the important information of the original data.
arXiv Detail & Related papers (2023-07-09T04:59:10Z) - Probabilistic Matching of Real and Generated Data Statistics in Generative Adversarial Networks [0.6906005491572401]
We propose a method to ensure that the distributions of certain generated data statistics coincide with the respective distributions of the real data.
We evaluate the method on a synthetic dataset and a real-world dataset and demonstrate improved performance of our approach.
arXiv Detail & Related papers (2023-06-19T14:03:27Z) - Did You Train on My Dataset? Towards Public Dataset Protection with
Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data.
By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders.
This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z) - Self-Conditioned Generative Adversarial Networks for Image Editing [61.50205580051405]
Generative Adversarial Networks (GANs) are susceptible to bias, learned from either the unbalanced data, or through mode collapse.
We argue that this bias is responsible not only for fairness concerns, but that it plays a key role in the collapse of latent-traversal editing methods when deviating away from the distribution's core.
arXiv Detail & Related papers (2022-02-08T18:08:24Z) - Towards an efficient framework for Data Extraction from Chart Images [27.114170963444074]
We adopt state-of-the-art computer vision techniques for the data extraction stage in a data mining system.
For building a robust point detector, a fully convolutional network with feature fusion module is adopted.
For data conversion, we translate the detected element into data with semantic value.
arXiv Detail & Related papers (2021-05-05T13:18:53Z) - Source-free Domain Adaptation via Distributional Alignment by Matching
Batch Normalization Statistics [85.75352990739154]
We propose a novel domain adaptation method for the source-free setting.
We use batch normalization statistics stored in the pretrained model to approximate the distribution of unobserved source data.
Our method achieves competitive performance with state-of-the-art domain adaptation methods.
arXiv Detail & Related papers (2021-01-19T14:22:33Z) - Graph Embedding with Data Uncertainty [113.39838145450007]
spectral-based subspace learning is a common data preprocessing step in many machine learning pipelines.
Most subspace learning methods do not take into consideration possible measurement inaccuracies or artifacts that can lead to data with high uncertainty.
arXiv Detail & Related papers (2020-09-01T15:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.