AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
- URL: http://arxiv.org/abs/2602.09611v1
- Date: Tue, 10 Feb 2026 10:02:29 GMT
- Title: AGMark: Attention-Guided Dynamic Watermarking for Large Vision-Language Models
- Authors: Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang,
- Abstract summary: Vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding.<n>We propose Attention-Guided Dynamic Watermarking (AGMark)<n>AGMark embeds detectable signals while strictly preserving visual fidelity.
- Score: 28.393476667026523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Watermarking has emerged as a pivotal solution for content traceability and intellectual property protection in Large Vision-Language Models (LVLMs). However, vision-agnostic watermarks may introduce visually irrelevant tokens and disrupt visual grounding by enforcing indiscriminate pseudo-random biases. Additionally, current vision-specific watermarks rely on a static, one-time estimation of vision critical weights and ignore the weight distribution density when determining the proportion of protected tokens. This design fails to account for dynamic changes in visual dependence during generation and may introduce low-quality tokens in the long tail. To address these challenges, we propose Attention-Guided Dynamic Watermarking (AGMark), a novel framework that embeds detectable signals while strictly preserving visual fidelity. At each decoding step, AGMark first dynamically identifies semantic-critical evidence based on attention weights for visual relevance, together with context-aware coherence cues, resulting in a more adaptive and well-calibrated evidence-weight distribution. It then determines the proportion of semantic-critical tokens by jointly considering uncertainty awareness (token entropy) and evidence calibration (weight density), thereby enabling adaptive vocabulary partitioning to avoid irrelevant tokens. Empirical results confirm that AGMark outperforms conventional methods, observably improving generation quality and yielding particularly strong gains in visual semantic fidelity in the later stages of generation. The framework maintains highly competitive detection accuracy (at least 99.36\% AUC) and robust attack resilience (at least 88.61\% AUC) without sacrificing inference efficiency, effectively establishing a new standard for reliability-preserving multi-modal watermarking.
Related papers
- X-Mark: Saliency-Guided Robust Dataset Ownership Verification for Medical Imaging [67.85884025186755]
High-quality medical imaging datasets are essential for training deep learning models, but their unauthorized use raises serious copyright and ethical concerns.<n>Medical imaging presents a unique challenge for existing dataset ownership verification methods designed for natural images.<n>We propose X-Mark, a sample-specific clean-label watermarking method for chest x-ray copyright protection.
arXiv Detail & Related papers (2026-02-10T00:03:43Z) - A Visual Semantic Adaptive Watermark grounded by Prefix-Tuning for Large Vision-Language Model [48.79816664229285]
VIsual Semantic Adaptive Watermark (VISA-Mark) is a novel framework that embeds detectable signals while strictly preserving visual fidelity.<n>Our approach employs a lightweight, efficiently trained prefix-tuner to extract dynamic Visual-Evidence Weights.<n> Empirical results confirm that VISA-Mark outperforms conventional methods with a 7.8% improvement in visual consistency.
arXiv Detail & Related papers (2026-01-12T07:55:13Z) - TransFIRA: Transfer Learning for Face Image Recognizability Assessment [73.61309363885552]
TransFIRA is a lightweight and annotation-free framework that grounds recognizability directly in embedding space.<n>New extensions beyond faces include encoder-grounded explainability that reveals how degradations and subject-specific factors affect recognizability.<n> Experiments confirm state-of-the-art results on faces, strong robustness on body recognition, and under cross-dataset shifts.
arXiv Detail & Related papers (2025-10-07T18:16:21Z) - An Ensemble Framework for Unbiased Language Model Watermarking [60.99969104552168]
We propose ENS, a novel ensemble framework that enhances the detectability and robustness of unbiased watermarks.<n>ENS sequentially composes multiple independent watermark instances, each governed by a distinct key, to amplify the watermark signal.<n> Empirical evaluations show that ENS substantially reduces the number of tokens needed for reliable detection and increases resistance to smoothing and paraphrasing attacks.
arXiv Detail & Related papers (2025-09-28T19:37:44Z) - StableGuard: Towards Unified Copyright Protection and Tamper Localization in Latent Diffusion Models [55.05404953041403]
We propose a novel framework that seamlessly integrates a binary watermark into the diffusion generation process.<n>We show that StableGuard consistently outperforms state-of-the-art methods in image fidelity, watermark verification, and tampering localization.
arXiv Detail & Related papers (2025-09-22T16:35:19Z) - VLA-Mark: A cross modal watermark for large vision-language alignment model [44.59029116115437]
VLA-Mark is a vision-aligned framework that embeds detectable watermarks while preserving semantic fidelity through cross-modal coordination.<n>Our approach integrates multiscale visual-textual alignment metrics, combining localized patch affinity, global semantic coherence, and contextual attention patterns.<n>Experiments show 7.4% lower PPL and 26.6% higher BLEU than conventional methods, with near-perfect detection.
arXiv Detail & Related papers (2025-07-18T16:44:41Z) - CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching [31.42896369011162]
CoMatch is a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy.<n>A covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores.<n>A fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level.
arXiv Detail & Related papers (2025-03-31T10:17:01Z) - ARBEx: Attentive Feature Extraction with Reliability Balancing for Robust Facial Expression Learning [5.648318448953635]
ARBEx is a novel attentive feature extraction framework driven by Vision Transformer.
We employ learnable anchor points in the embedding space with label distributions and multi-head self-attention mechanism to optimize performance against weak predictions.
Our strategy outperforms current state-of-the-art methodologies, according to extensive experiments conducted in a variety of contexts.
arXiv Detail & Related papers (2023-05-02T15:10:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.