A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model
- URL: http://arxiv.org/abs/2601.07291v1
- Date: Mon, 12 Jan 2026 07:55:13 GMT
- Title: A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model
- Authors: Qi Zheng, Shuliang Liu, Yu Huang, Sihang Jia, Jungang Li, Lyuhao Chen, Junhao Chen, Hanqian Li, Aiwei Liu, Yibo Yan, Xuming Hu,
- Abstract summary: VIsual Semantic Adaptive Watermark (VISA-Mark) is a novel framework that embeds detectable signals while strictly preserving visual fidelity.<n>Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights.<n> Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency.
- Score: 48.79816664229285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases, while some semantic-aware methods incur prohibitive inference latency due to rejection sampling. In this paper, we propose the VIsual Semantic Adaptive Watermark (VISA-Mark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights, which quantify the evidentiary support for candidate tokens based on the visual input. These weights guide an adaptive vocabulary partitioning and logits perturbation mechanism, concentrating watermark strength specifically on visually-supported tokens. By actively aligning the watermark with visual evidence, VISA-Mark effectively maintains visual fidelity. Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency (Chair-I) and superior semantic fidelity. The framework maintains highly competitive detection accuracy (96.88% AUC) and robust attack resilience (99.3%) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multimodal watermarking.
Related papers
- AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models [28.393476667026523]
Vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding.<n>We propose Attention-Guided Dynamic Watermarking (AGMark)<n>AGMark embeds detectable signals while strictly preserving visual fidelity.
arXiv Detail & Related papers (2026-02-10T10:02:29Z) - X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging [67.85884025186755]
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns.<n>Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images.<n>We propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection.
arXiv Detail & Related papers (2026-02-10T00:03:43Z) - Distilling the Thought, Watermarking the Answer: A Principle Semantic Guided Watermark for Large Reasoning Models [46.12198035083885]
This paper introduces ReasonMark, a novel watermarking framework specifically designed for reasoning-intensive LLMs.<n>Our approach decouples generation into an undisturbed Thinking Phase and a watermarked Answering Phase.<n>Experiments show ReasonMark surpasses state-of-the-art methods by reducing text Perplexity by 0.35, increasing translation BLEU score by 0.164, and raising mathematical accuracy by 0.67 points.
arXiv Detail & Related papers (2026-01-08T17:32:22Z) - An Ensemble Framework for Unbiased Language Model Watermarking [60.99969104552168]
We propose ENS, a novel ensemble framework that enhances the detectability and robustness of unbiased watermarks.<n>ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal.<n> Empirical evaluations show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks.
arXiv Detail & Related papers (2025-09-28T19:37:44Z) - Diversity Boosts AI-Generated Text Detection [51.56484100374058]
DivEye is a novel framework that captures how unpredictability fluctuates across a text using surprisal-based features.<n>Our method outperforms existing zero-shot detectors by up to 33.2% and achieves competitive performance with fine-tuned baselines.
arXiv Detail & Related papers (2025-09-23T10:21:22Z) - StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models [55.05404953041403]
We propose a novel framework that seamlessly integrates a binary watermark into the diffusion generation process.<n>We show that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
arXiv Detail & Related papers (2025-09-22T16:35:19Z) - Semantic Watermarking Reinvented: Enhancing Robustness and Generation Quality with Fourier Integrity [31.666430190864947]
We propose a novel embedding method called Hermitian Symmetric Fourier Watermarking (SFW)<n>SFW maintains frequency integrity by enforcing Hermitian symmetry.<n>We introduce a center-aware embedding strategy that reduces the vulnerability of semantic watermarking due to cropping attacks.
arXiv Detail & Related papers (2025-09-09T12:15:16Z) - VLA-Mark: A cross modal watermark for large vision-language alignment model [44.59029116115437]
VLA-Mark is a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination.<n>Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns.<n>Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection.
arXiv Detail & Related papers (2025-07-18T16:44:41Z) - Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models [31.062753031312006]
Large language models generate high-quality responses with potential misinformation.
Watermarking is pivotal in this context, which involves embedding hidden markers in texts.
We introduce a novel multi-objective optimization (MOO) approach for watermarking.
Our method simultaneously achieves detectability and semantic integrity.
arXiv Detail & Related papers (2024-02-28T05:43:22Z) - A Resilient and Accessible Distribution-Preserving Watermark for Large Language Models [65.40460716619772]
Our research focuses on the importance of a textbfDistribution-textbfPreserving (DiP) watermark.
Contrary to the current strategies, our proposed DiPmark simultaneously preserves the original token distribution during watermarking.
It is detectable without access to the language model API and prompts (accessible), and is provably robust to moderate changes of tokens.
arXiv Detail & Related papers (2023-10-11T17:57:35Z) - ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning [5.648318448953635]
ARBEx is a novel attentive feature extraction framework driven by Vision Transformer.
We employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions.
Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.
arXiv Detail & Related papers (2023-05-02T15:10:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.