WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
- URL: http://arxiv.org/abs/2512.19048v1
- Date: Mon, 22 Dec 2025 05:33:59 GMT
- Title: WaTeRFlow: Watermark Temporal Robustness via Flow Consistency
- Authors: Utae Jeong, Sumin In, Hyunju Ryu, Jaewan Choi, Feng Yang, Jongheon Jeong, Seungryong Kim, Sangpil Kim,
- Abstract summary: We present WaTeRFlow, a framework tailored for robustness under I2V.<n>It exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training.<n>Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience.
- Score: 46.206343565195375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image watermarking supports authenticity and provenance, yet many schemes are still easy to bypass with various distortions and powerful generative edits. Deep learning-based watermarking has improved robustness to diffusion-based image editing, but a gap remains when a watermarked image is converted to video by image-to-video (I2V), in which per-frame watermark detection weakens. I2V has quickly advanced from short, jittery clips to multi-second, temporally coherent scenes, and it now serves not only content creation but also world-modeling and simulation workflows, making cross-modal watermark recovery crucial. We present WaTeRFlow, a framework tailored for robustness under I2V. It consists of (i) FUSE (Flow-guided Unified Synthesis Engine), which exposes the encoder-decoder to realistic distortions via instruction-driven edits and a fast video diffusion proxy during training, (ii) optical-flow warping with a Temporal Consistency Loss (TCL) that stabilizes per-frame predictions, and (iii) a semantic preservation loss that maintains the conditioning signal. Experiments across representative I2V models show accurate watermark recovery from frames, with higher first-frame and per-frame bit accuracy and resilience when various distortions are applied before or after video generation.
Related papers
- SKeDA: A Generative Watermarking Framework for Text-to-video Diffusion Models [40.540302276054376]
We propose a generative watermarking framework tailored for text-to-video diffusion models.<n> SKeDA consists of two components: (1) Shuffle-Key-based Distribution-preserving Sampling (SKe) employs a single base pseudo-random binary sequence for watermark encryption and derives frame-level encryption sequences through permutation.<n>Extensive experiments demonstrate that SKeDA preserves high video generation quality and watermark robustness.
arXiv Detail & Related papers (2026-02-27T06:18:03Z) - SPDMark: Selective Parameter Displacement for Robust Video Watermarking [30.398519705830264]
This work introduces a novel framework for in-generation video watermarking called SPDMark.<n>Watermarks are embedded into the generated videos by modifying a subset of parameters in the generative model.<n> Evaluations on both text-to-video and image-to-video generation models demonstrate the ability of SPDMark to generate imperceptible watermarks.
arXiv Detail & Related papers (2025-12-12T23:35:13Z) - VTinker: Guided Flow Upsampling and Texture Mapping for High-Resolution Video Frame Interpolation [55.93266219195357]
We propose a novel Video Frame Interpolation (VFI) pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping.<n>In this study, we propose a novel VFI pipeline, VTinker, which consists of two core components: guided flow upsampling (GFU) and Texture Mapping.
arXiv Detail & Related papers (2025-11-20T07:30:16Z) - I2VWM: Robust Watermarking for Image to Video Generation [41.34965301146522]
I2VWM is a cross-modal watermarking framework designed to enhance watermark robustness across time.<n> Experiments on both open-source and commercial I2V models demonstrate that I2VWM significantly improves robustness while maintaining imperceptibility.
arXiv Detail & Related papers (2025-09-22T13:37:37Z) - VIDSTAMP: A Temporally-Aware Watermark for Ownership and Integrity in Video Diffusion Models [32.0365189539138]
VIDSTAMP is a watermarking framework that embeds messages directly into the latent space of temporally-aware video diffusion models.<n>Our method imposes no additional inference cost and offers better perceptual quality than prior methods.
arXiv Detail & Related papers (2025-05-02T17:35:03Z) - WaterFlow: Learning Fast & Robust Watermarks using Stable Diffusion [46.10882190865747]
WaterFlow is a fast and extremely robust approach for high fidelity visual watermarking based on a learned latent-dependent watermark.<n>WaterFlow demonstrates state-of-the-art performance on general robustness and is the first method capable of effectively defending against difficult combination attacks.
arXiv Detail & Related papers (2025-04-15T23:27:52Z) - Taming Rectified Flow for Inversion and Editing [57.3742655030493]
Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation.<n>Despite their robust generative capabilities, these models often struggle with inaccuracies.<n>We propose RF-r, a training-free sampler that effectively enhances inversion precision by mitigating the errors in the inversion process of rectified flow.
arXiv Detail & Related papers (2024-11-07T14:29:02Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.