Related papers: FreqyWM: Frequency Watermarking for the New Data Economy

FreqyWM: Frequency Watermarking for the New Data Economy

URL: http://arxiv.org/abs/2312.16547v1
Date: Wed, 27 Dec 2023 12:17:59 GMT
Title: FreqyWM: Frequency Watermarking for the New Data Economy
Authors: Devriş İşler, Elisa Cabana, Alvaro Garcia-Recuero, Georgia Koutrika, Nikolaos Laoutaris,
Abstract summary: We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark. We develop optimal as well as fast algorithms for creating and verifying such watermarks.
Score: 8.51675079658644
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present a novel technique for modulating the appearance frequency of a few tokens within a dataset for encoding an invisible watermark that can be used to protect ownership rights upon data. We develop optimal as well as fast heuristic algorithms for creating and verifying such watermarks. We also demonstrate the robustness of our technique against various attacks and derive analytical bounds for the false positive probability of erroneously detecting a watermark on a dataset that does not carry it. Our technique is applicable to both single dimensional and multidimensional datasets, is independent of token type, allows for a fine control of the introduced distortion, and can be used in a variety of use cases that involve buying and selling data in contemporary data marketplaces.

Related papers

Data Watermarking for Sequential Recommender Systems [52.207721219147814]
We study the problem of data watermarking for sequential recommender systems. dataset watermarking protects the ownership of the entire dataset, and user watermarking safeguards the data of individual users. Our approach involves randomly selecting unpopular items to create a watermark sequence, which is then inserted into normal users' interaction sequences.
arXiv Detail & Related papers (2024-11-20T02:34:21Z)
Adaptive and Robust Watermark for Generative Tabular Data [8.566821590631907]
We propose a flexible and robust watermarking mechanism for generative tabular data. We show theoretically and empirically that the watermarked datasets have negligible impact on the data quality and downstream utility.
arXiv Detail & Related papers (2024-09-23T04:37:30Z)
TabularMark: Watermarking Tabular Datasets for Machine Learning [20.978995194849297]
We propose a hypothesis testing-based watermarking scheme, TabularMark. Data noise partitioning is utilized for data perturbation during embedding. Experiments on real-world and synthetic datasets demonstrate the superiority of TabularMark in detectability, non-intrusiveness, and robustness.
arXiv Detail & Related papers (2024-06-21T02:58:45Z)
EnTruth: Enhancing the Traceability of Unauthorized Dataset Usage in Text-to-image Diffusion Models with Minimal and Robust Alterations [73.94175015918059]
We introduce a novel approach, EnTruth, which Enhances Traceability of unauthorized dataset usage. By strategically incorporating the template memorization, EnTruth can trigger the specific behavior in unauthorized models as the evidence of infringement. Our method is the first to investigate the positive application of memorization and use it for copyright protection, which turns a curse into a blessing.
arXiv Detail & Related papers (2024-06-20T02:02:44Z)
TokenMark: A Modality-Agnostic Watermark for Pre-trained Transformers [67.57928750537185]
TokenMark is a robust, modality-agnostic, robust watermarking system for pre-trained models. It embeds the watermark by fine-tuning the pre-trained model on a set of specifically permuted data samples. It significantly improves the robustness, efficiency, and universality of model watermarking.
arXiv Detail & Related papers (2024-03-09T08:54:52Z)
Domain Watermark: Effective and Harmless Dataset Copyright Protection is Closed at Hand [96.26251471253823]
backdoor-based dataset ownership verification (DOV) is currently the only feasible approach to protect the copyright of open-source datasets. We make watermarked models (trained on the protected dataset) correctly classify some hard' samples that will be misclassified by the benign model.
arXiv Detail & Related papers (2023-10-09T11:23:05Z)
Did You Train on My Dataset? Towards Public Dataset Protection with Clean-Label Backdoor Watermarking [54.40184736491652]
We propose a backdoor-based watermarking approach that serves as a general framework for safeguarding public-available data. By inserting a small number of watermarking samples into the dataset, our approach enables the learning model to implicitly learn a secret function set by defenders. This hidden function can then be used as a watermark to track down third-party models that use the dataset illegally.
arXiv Detail & Related papers (2023-03-20T21:54:30Z)
A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models. The watermark can be embedded with negligible impact on text quality. It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
On the Effectiveness of Dataset Watermarking in Adversarial Settings [14.095584034871658]
We investigate a proposed data provenance method, radioactive data, to assess if it can be used to demonstrate ownership of (image) datasets used to train machine learning (ML) models. We show that radioactive data can effectively survive model extraction attacks, which raises the possibility that it can be used for ML model ownership verification robust against model extraction.
arXiv Detail & Related papers (2022-02-25T05:51:53Z)
Anonymizing Sensor Data on the Edge: A Representation Learning and Transformation Approach [4.920145245773581]
In this paper, we aim to examine the tradeoff between utility and privacy loss by learning low-dimensional representations that are useful for data obfuscation. We propose deterministic and probabilistic transformations in the latent space of a variational autoencoder to synthesize time series data. We show that it can anonymize data in real time on resource-constrained edge devices.
arXiv Detail & Related papers (2020-11-16T22:32:30Z)
Open-sourced Dataset Protection via Backdoor Watermarking [87.15630326131901]
We propose a emphbackdoor embedding based dataset watermarking method to protect an open-sourced image-classification dataset. We use a hypothesis test guided method for dataset verification based on the posterior probability generated by the suspicious third-party model.
arXiv Detail & Related papers (2020-10-12T16:16:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.