Watermark Text Pattern Spotting in Document Images
- URL: http://arxiv.org/abs/2401.05167v2
- Date: Thu, 11 Jan 2024 10:32:49 GMT
- Title: Watermark Text Pattern Spotting in Document Images
- Authors: Mateusz Krubi\'nski, Stefan Matcovici, Diana Grigore, Daniel Voinea
and Alin-Ionut Popa
- Abstract summary: In the wild, writing can come in various fonts, sizes and forms, making generic recognition a very difficult problem.
We propose a novel benchmark (K-Watermark) containing 65,447 data samples generated using Wrender.
A validity study using humans raters yields an authenticity score of 0.51 against pre-generated watermarked documents.
- Score: 3.6298655794854464
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Watermark text spotting in document images can offer access to an often
unexplored source of information, providing crucial evidence about a record's
scope, audience and sometimes even authenticity. Stemming from the problem of
text spotting, detecting and understanding watermarks in documents inherits the
same hardships - in the wild, writing can come in various fonts, sizes and
forms, making generic recognition a very difficult problem. To address the lack
of resources in this field and propel further research, we propose a novel
benchmark (K-Watermark) containing 65,447 data samples generated using Wrender,
a watermark text patterns rendering procedure. A validity study using humans
raters yields an authenticity score of 0.51 against pre-generated watermarked
documents. To prove the usefulness of the dataset and rendering technique, we
developed an end-to-end solution (Wextract) for detecting the bounding box
instances of watermark text, while predicting the depicted text. To deal with
this specific task, we introduce a variance minimization loss and a
hierarchical self-attention mechanism. To the best of our knowledge, we are the
first to propose an evaluation benchmark and a complete solution for retrieving
watermarks from documents surpassing baselines by 5 AP points in detection and
4 points in character accuracy.
Related papers
- Efficiently Identifying Watermarked Segments in Mixed-Source Texts [35.437251393372954]
We propose two novel methods for partial watermark detection.
First, we develop a geometry cover detection framework aimed at determining whether there is a watermark segment in long text.
Second, we introduce an adaptive online learning algorithm to pinpoint the precise location of watermark segments within the text.
arXiv Detail & Related papers (2024-10-04T16:58:41Z) - WaterSeeker: Pioneering Efficient Detection of Watermarked Segments in Large Documents [65.11018806214388]
WaterSeeker is a novel approach to efficiently detect and locate watermarked segments amid extensive natural text.
It achieves a superior balance between detection accuracy and computational efficiency.
WaterSeeker's localization ability supports the development of interpretable AI detection systems.
arXiv Detail & Related papers (2024-09-08T14:45:47Z) - Duwak: Dual Watermarks in Large Language Models [49.00264962860555]
We propose, Duwak, to enhance the efficiency and quality of watermarking by embedding dual secret patterns in both token probability distribution and sampling schemes.
We evaluate Duwak extensively on Llama2, against four state-of-the-art watermarking techniques and combinations of them.
arXiv Detail & Related papers (2024-03-12T16:25:38Z) - Multi-Bit Distortion-Free Watermarking for Large Language Models [4.7381853007029475]
We extend an existing zero-bit distortion-free watermarking method by embedding multiple bits of meta-information as part of the watermark.
We also develop a computationally efficient decoder that extracts the embedded information from the watermark with low bit error rate.
arXiv Detail & Related papers (2024-02-26T14:01:34Z) - I Know You Did Not Write That! A Sampling Based Watermarking Method for
Identifying Machine Generated Text [0.0]
We propose a new watermarking method to detect machine-generated texts.
Our method embeds a unique pattern within the generated text.
We show how watermarking affects textual quality and compare our proposed method with a state-of-the-art watermarking method.
arXiv Detail & Related papers (2023-11-29T20:04:57Z) - Watermarking Conditional Text Generation for AI Detection: Unveiling
Challenges and a Semantic-Aware Watermark Remedy [52.765898203824975]
We introduce a semantic-aware watermarking algorithm that considers the characteristics of conditional text generation and the input context.
Experimental results demonstrate that our proposed method yields substantial improvements across various text generation models.
arXiv Detail & Related papers (2023-07-25T20:24:22Z) - On the Reliability of Watermarks for Large Language Models [95.87476978352659]
We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document.
We find that watermarks remain detectable even after human and machine paraphrasing.
We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document.
arXiv Detail & Related papers (2023-06-07T17:58:48Z) - Watermarking Text Generated by Black-Box Language Models [103.52541557216766]
A watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation.
A detection algorithm aware of the list can identify the watermarked text.
We develop a watermarking framework for black-box language model usage scenarios.
arXiv Detail & Related papers (2023-05-14T07:37:33Z) - A Watermark for Large Language Models [84.95327142027183]
We propose a watermarking framework for proprietary language models.
The watermark can be embedded with negligible impact on text quality.
It can be detected using an efficient open-source algorithm without access to the language model API or parameters.
arXiv Detail & Related papers (2023-01-24T18:52:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.