Related papers: VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models

URL: http://arxiv.org/abs/2505.01406v1
Date: Fri, 02 May 2025 17:35:03 GMT
Title: VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models
Authors: Mohammadreza Teymoorianfard, Shiqing Ma, Amir Houmansadr,
Abstract summary: VIDSTAMP is a watermarking framework that embeds messages directly into the latent space of temporally-aware video diffusion models.<n>Our method imposes no additional inference cost and offers better perceptual quality than prior methods.
Score: 32.0365189539138
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid rise of video diffusion models has enabled the generation of highly realistic and temporally coherent videos, raising critical concerns about content authenticity, provenance, and misuse. Existing watermarking approaches, whether passive, post-hoc, or adapted from image-based techniques, often struggle to withstand video-specific manipulations such as frame insertion, dropping, or reordering, and typically degrade visual quality. In this work, we introduce VIDSTAMP, a watermarking framework that embeds per-frame or per-segment messages directly into the latent space of temporally-aware video diffusion models. By fine-tuning the model's decoder through a two-stage pipeline, first on static image datasets to promote spatial message separation, and then on synthesized video sequences to restore temporal consistency, VIDSTAMP learns to embed high-capacity, flexible watermarks with minimal perceptual impact. Leveraging architectural components such as 3D convolutions and temporal attention, our method imposes no additional inference cost and offers better perceptual quality than prior methods, while maintaining comparable robustness against common distortions and tampering. VIDSTAMP embeds 768 bits per video (48 bits per frame) with a bit accuracy of 95.0%, achieves a log P-value of -166.65 (lower is better), and maintains a video quality score of 0.836, comparable to unwatermarked outputs (0.838) and surpassing prior methods in capacity-quality tradeoffs. Code: Code: \url{https://github.com/SPIN-UMass/VidStamp}

Related papers

Video Signature: In-generation Watermarking for Latent Video Diffusion Models [19.648332041264474]
Video Signature (VID SIG) is an in-generation watermarking method for latent video diffusion models.<n>We achieve this by partially fine-tuning the latent decoder, where Perturbation-Aware Suppression (PAS) pre-identifies and freezes perceptually sensitive layers to preserve visual quality.<n> Experimental results show that VID SIG achieves the best overall performance in watermark extraction, visual quality, and generation efficiency.
arXiv Detail & Related papers (2025-05-31T17:43:54Z)
Safe-Sora: Safe Text-to-Video Generation via Graphical Watermarking [53.434260110195446]
Safe-Sora is the first framework to embed graphical watermarks directly into the video generation process.<n>We develop a 3D wavelet transform-enhanced Mamba architecture with a adaptive localtemporal scanning strategy.<n>Experiments demonstrate Safe-Sora achieves state-of-the-art performance in terms of video quality, watermark fidelity, and robustness.
arXiv Detail & Related papers (2025-05-19T03:31:31Z)
VideoPASTA: 7K Preference Pairs That Matter for Video-LLM Alignment [0.6854849895338531]
Video-language models (Video-LLMs) excel at understanding video content but struggle with spatial relationships, temporal ordering, and cross-frame continuity.<n>We introduce VideoPASTA, a framework that enhances Video-LLMs through targeted preference optimization.
arXiv Detail & Related papers (2025-04-18T22:28:03Z)
Video-Panda: Parameter-efficient Alignment for Encoder-free Video-Language Models [26.866184981409607]
We present an efficient encoder-free approach for video-language understanding that achieves competitive performance while significantly reducing computational overhead.<n>Our method introduces a novel Spatio-Temporal Alignment Block (STAB) that directly processes video inputs without requiring pre-trained encoders.<n>Our model achieves comparable or superior performance to encoder-based approaches for open-ended video question answering on standard benchmarks.
arXiv Detail & Related papers (2024-12-24T18:59:56Z)
Large Motion Video Autoencoding with Cross-modal Video VAE [52.13379965800485]
Video Variational Autoencoder (VAE) is essential for reducing video redundancy and facilitating efficient video generation.<n>Existing Video VAEs have begun to address temporal compression; however, they often suffer from inadequate reconstruction performance.<n>We present a novel and powerful video autoencoder capable of high-fidelity video encoding.
arXiv Detail & Related papers (2024-12-23T18:58:24Z)
Video Seal: Open and Efficient Video Watermarking [47.40833588157406]
Video watermarking addresses challenges by embedding imperceptible signals into videos, allowing for identification.<n>Video Seal is a comprehensive framework for neural video watermarking and a competitive open-sourced model.<n>We present experimental results demonstrating the effectiveness of the approach in terms of speed, imperceptibility, and robustness.
arXiv Detail & Related papers (2024-12-12T17:41:49Z)
LVMark: Robust Watermark for Latent Video Diffusion Models [13.85241328100336]
We introduce LVMark, a novel watermarking method for video diffusion models.<n>We propose a new watermark decoder tailored for generated videos by learning the consistency between adjacent frames.<n>We optimize both the watermark decoder and the latent decoder of diffusion model, effectively balancing the trade-off between visual quality and bit accuracy.
arXiv Detail & Related papers (2024-12-12T09:57:20Z)
Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors [54.8852848659663]
Buffer Anytime is a framework for estimation of depth and normal maps (which we call geometric buffers) from video. We demonstrate high-quality video buffer estimation by leveraging single-image priors with temporal consistency constraints.
arXiv Detail & Related papers (2024-11-26T09:28:32Z)
Blurry Video Compression: A Trade-off between Visual Enhancement and Data Compression [65.8148169700705]
Existing video compression (VC) methods primarily aim to reduce the spatial and temporal redundancies between consecutive frames in a video. Previous works have achieved remarkable results on videos acquired under specific settings such as instant (known) exposure time and shutter speed. In this work, we tackle the VC problem in a general scenario where a given video can be blurry due to predefined camera settings or dynamics in the scene.
arXiv Detail & Related papers (2023-11-08T02:17:54Z)
VideoGen: A Reference-Guided Latent Diffusion Approach for High Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency. We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)
Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras. We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.