FreqyWM: Frequency Watermarking for the New Data Economy
- URL: http://arxiv.org/abs/2312.16547v1
- Date: Wed, 27 Dec 2023 12:17:59 GMT
- Title: FreqyWM: Frequency Watermarking for the New Data Economy
- Authors: Devriş İşler, Elisa Cabana, Alvaro Garcia-Recuero, Georgia Koutrika, Nikolaos Laoutaris,
- Abstract summary: We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark.
We develop optimal as well as fast algorithms for creating and verifying such watermarks.
- Score: 8.51675079658644
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuristic algorithms for creating and verifying such watermarks. We also demonstrate the robustness of our technique against various attacks and derive analytical bounds for the false positive probability of erroneously detecting a watermark on a dataset that does not carry it. Our technique is applicable to both single dimensional and multidimensional datasets, is independent of token type, allows for a fine control of the introduced distortion, and can be used in a variety of use cases that involve buying and selling data in contemporary data marketplaces.
Related papers
- Adaptive and Robust Watermark for Generative Tabular Data [8.566821590631907]
We propose a flexible and robust watermarking mechanism for generative tabular data.
We show theoretically and empirically that the watermarked datasets have negligible impact on the data quality and downstream utility.
arXiv Detail & Related papers (2024-09-23T04:37:30Z) - TabularMark: Watermarking Tabular Datasets for Machine Learning [20.978995194849297]
We propose a hypothesis testing-based watermarking scheme, TabularMark.
Data noise partitioning is utilized for data perturbation during embedding.
Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
arXiv Detail & Related papers (2024-06-21T02:58:45Z) - EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations [73.94175015918059]
We introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage.
By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement.
Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing.
arXiv Detail & Related papers (2024-06-20T02:02:44Z) - DREW : Towards Robust Data Provenance by Leveraging Error-Controlled Watermarking [58.37644304554906]
We propose Data Retrieval with Error-corrected codes and Watermarking (DREW)
DREW randomly clusters the reference dataset and injects unique error-controlled watermark keys into each cluster.
After locating the relevant cluster, embedding vector similarity retrieval is performed within the cluster to find the most accurate matches.
arXiv Detail & Related papers (2024-06-05T01:19:44Z) - Domain Watermark: Effective and Harmless Dataset Copyright Protection is
Closed at Hand [96.26251471253823]
backdoor-based dataset ownership verification (DOV) is currently the only feasible approach to protect the copyright of open-source datasets.
We make watermarked models (trained on the protected dataset) correctly classify some hard' samples that will be misclassified by the benign model.
arXiv Detail & Related papers (2023-10-09T11:23:05Z) - Did You Train on My Dataset? Towards Public Dataset Protection with
Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data.
By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders.
This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z) - On the Effectiveness of Dataset Watermarking in Adversarial Settings [14.095584034871658]
We investigate a proposed data provenance method, radioactive data, to assess if it can be used to demonstrate ownership of (image) datasets used to train machine learning (ML) models.
We show that radioactive data can effectively survive model extraction attacks, which raises the possibility that it can be used for ML model ownership verification robust against model extraction.
arXiv Detail & Related papers (2022-02-25T05:51:53Z) - Anonymizing Sensor Data on the Edge: A Representation Learning and
Transformation Approach [4.920145245773581]
In this paper, we aim to examine the tradeoff between utility and privacy loss by learning low-dimensional representations that are useful for data obfuscation.
We propose deterministic and probabilistic transformations in the latent space of a variational autoencoder to synthesize time series data.
We show that it can anonymize data in real time on resource-constrained edge devices.
arXiv Detail & Related papers (2020-11-16T22:32:30Z) - Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset.
We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.