Innamark: A Whitespace Replacement Information-Hiding Method
- URL: http://arxiv.org/abs/2502.12710v2
- Date: Mon, 28 Apr 2025 19:26:36 GMT
- Title: Innamark: A Whitespace Replacement Information-Hiding Method
- Authors: Malte Hellmeier, Hendrik Norkowski, Ernst-Christoph Schrewe, Haydar Qarawlus, Falk Howar,
- Abstract summary: We introduce a novel method for information hiding called Innamark.<n>Innamark can conceal any byte-encoded sequence within a sufficiently long cover text.<n>We propose a specified structure for secret messages that enables compression, encryption, hashing, and error correction.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) have gained significant popularity in recent years. Differentiating between a text written by a human and one generated by an LLM has become almost impossible. Information-hiding techniques such as digital watermarking or steganography can help by embedding information inside text in a form that is unlikely to be noticed. However, existing techniques, such as linguistic-based or format-based methods, change the semantics or cannot be applied to pure, unformatted text. In this paper, we introduce a novel method for information hiding called Innamark, which can conceal any byte-encoded sequence within a sufficiently long cover text. This method is implemented as a multi-platform library using the Kotlin programming language, which is accompanied by a command-line tool and a web interface. By substituting conventional whitespace characters with visually similar Unicode whitespace characters, our proposed scheme preserves the semantics of the cover text without changing the number of characters. Furthermore, we propose a specified structure for secret messages that enables configurable compression, encryption, hashing, and error correction. An experimental benchmark comparison on a dataset of 1000000 Wikipedia articles compares ten algorithms. The results demonstrate the robustness of our proposed Innamark method in various applications and the imperceptibility of its watermarks to humans. We discuss the limits to the embedding capacity and robustness of the algorithm and how these could be addressed in future work.
Related papers
- MoCE: Adaptive Mixture of Contextualization Experts for Byte-based Neural Machine Translation [13.70446799743065]
Byte-based machine translation systems have shown significant potential in massively multilingual settings.<n> Unicode encoding maps each character to specific byte(s) eliminating the emergence of unknown words, even in new languages.<n>Local contextualization has proven effective in assigning initial semantics to tokens, improving sentence comprehension.<n>We propose Mixture of Contextualization Experts (MoCE), adaptively selecting and mixing attention heads, which are treated as contextualization experts.
arXiv Detail & Related papers (2024-11-03T08:15:43Z) - Variables are a Curse in Software Vulnerability Prediction [4.453430599945387]
We introduce a new type of edge called name dependence, a type of abstract syntax graph based on the name dependence, and an efficient node representation method named 3-property encoding scheme.
These techniques will allow us to remove the concrete variable names from code, and facilitate deep learning models to learn the functionality of software hidden in diverse code expressions.
arXiv Detail & Related papers (2024-06-18T16:02:29Z) - Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery [50.564146730579424]
We propose a Text Embedding Synthesizer (TES) to generate pseudo text embeddings for unlabelled samples.
Our method unlocks the multi-modal potentials of CLIP and outperforms the baseline methods by a large margin on all GCD benchmarks.
arXiv Detail & Related papers (2024-03-15T02:40:13Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - Provably Robust Multi-bit Watermarking for AI-generated Text [37.21416140194606]
Large Language Models (LLMs) have demonstrated remarkable capabilities of generating texts resembling human language.
They can be misused by criminals to create deceptive content, such as fake news and phishing emails.
Watermarking is a key technique to address these concerns, which embeds a message into a text.
arXiv Detail & Related papers (2024-01-30T08:46:48Z) - Towards Codable Watermarking for Injecting Multi-bits Information to LLMs [86.86436777626959]
Large language models (LLMs) generate texts with increasing fluency and realism.
Existing watermarking methods are encoding-inefficient and cannot flexibly meet the diverse information encoding needs.
We propose Codable Text Watermarking for LLMs (CTWL) that allows text watermarks to carry multi-bit customizable information.
arXiv Detail & Related papers (2023-07-29T14:11:15Z) - Copy Is All You Need [66.00852205068327]
We formulate text generation as progressively copying text segments from an existing text collection.
Our approach achieves better generation quality according to both automatic and human evaluations.
Our approach attains additional performance gains by simply scaling up to larger text collections.
arXiv Detail & Related papers (2023-07-13T05:03:26Z) - Watermarking Text Generated by Black-Box Language Models [103.52541557216766]
A watermark-based method was proposed for white-box LLMs, allowing them to embed watermarks during text generation.
A detection algorithm aware of the list can identify the watermarked text.
We develop a watermarking framework for black-box language model usage scenarios.
arXiv Detail & Related papers (2023-05-14T07:37:33Z) - Enhancing Indic Handwritten Text Recognition Using Global Semantic
Information [36.01828106385858]
We use a semantic module in an encoder-decoder framework for extracting global semantic information to recognize the Indic handwritten texts.
The proposed framework achieves state-of-the-art results on handwritten texts of ten Indic languages.
arXiv Detail & Related papers (2022-12-15T12:53:26Z) - Autoregressive Linguistic Steganography Based on BERT and Consistency
Coding [17.881686153284267]
Linguistic steganography (LS) conceals the presence of communication by embedding secret information into a text.
Recent algorithms use a language model (LM) to generate the steganographic text, which provides a higher payload compared with many previous arts.
We propose a novel autoregressive LS algorithm based on BERT and consistency coding, which achieves a better trade-off between embedding payload and system security.
arXiv Detail & Related papers (2022-03-26T02:36:55Z) - SLOGAN: Handwriting Style Synthesis for Arbitrary-Length and
Out-of-Vocabulary Text [35.83345711291558]
We propose a novel method that can synthesize parameterized and controllable handwriting Styles for arbitrary-Length and Out-of-vocabulary text.
We embed the text content by providing an easily obtainable printed style image, so that the diversity of the content can be flexibly achieved.
Our method can synthesize words that are not included in the training vocabulary and with various new styles.
arXiv Detail & Related papers (2022-02-23T12:13:27Z) - A Simple and Efficient Probabilistic Language model for Code-Mixed Text [0.0]
We present a simple probabilistic approach for building efficient word embedding for code-mixed text.
We examine its efficacy for the classification task using bidirectional LSTMs and SVMs.
arXiv Detail & Related papers (2021-06-29T05:37:57Z) - Near-imperceptible Neural Linguistic Steganography via Self-Adjusting
Arithmetic Coding [88.31226340759892]
We present a new linguistic steganography method which encodes secret messages using self-adjusting arithmetic coding based on a neural language model.
Human evaluations show that 51% of generated cover texts can indeed fool eavesdroppers.
arXiv Detail & Related papers (2020-10-01T20:40:23Z) - Improving Disentangled Text Representation Learning with
Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging.
Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text.
Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.