Related papers: Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos

URL: http://arxiv.org/abs/2511.19936v1
Date: Tue, 25 Nov 2025 05:21:23 GMT
Title: Image Diffusion Models Exhibit Emergent Temporal Propagation in Videos
Authors: Youngseo Kim, Dohyun Kim, Geohee Han, Paul Hongsuck Seo,
Abstract summary: DRIFT is a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement.<n>We demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation.<n>Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement.
Score: 13.824335238443334
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Image diffusion models, though originally developed for image generation, implicitly capture rich semantic structures that enable various recognition and localization tasks beyond synthesis. In this work, we investigate their self-attention maps can be reinterpreted as semantic label propagation kernels, providing robust pixel-level correspondences between relevant image regions. Extending this mechanism across frames yields a temporal propagation kernel that enables zero-shot object tracking via segmentation in videos. We further demonstrate the effectiveness of test-time optimization strategies-DDIM inversion, textual inversion, and adaptive head weighting-in adapting diffusion features for robust and consistent label propagation. Building on these findings, we introduce DRIFT, a framework for object tracking in videos leveraging a pretrained image diffusion model with SAM-guided mask refinement, achieving state-of-the-art zero-shot performance on standard video object segmentation benchmarks.

Related papers

Seg4Diff: Unveiling Open-Vocabulary Segmentation in Text-to-Image Diffusion Transformers [56.76198904599581]
Text-to-image diffusion models excel at translating language prompts into implicitly grounding concepts through their cross-modal attention mechanisms.<n>Recent multi-modal diffusion transformers extend this by introducing joint self-attentiond image and text tokens, enabling richer and more scalable cross-modal alignment.<n>We introduce Seg4Diff, a systematic framework for analyzing the attention structures of MM-DiT, with a focus on how specific layers propagate semantic information from text to image.
arXiv Detail & Related papers (2025-09-22T17:59:54Z)
Domain Adaptive SAR Wake Detection: Leveraging Similarity Filtering and Memory Guidance [5.026771815351906]
We propose a Similarity-Guided and Memory-Guided Domain Adap- tation (termed SimMemDA) framework for unsupervised domain adaptive ship wake detection.<n>We first utilize WakeGAN to perform style transfer on optical images, generating pseudo-images close to the SAR style.<n>Then, instance-level feature similarity filtering mechanism is designed to identify and prioritize source samples with target-like dis-tributions.
arXiv Detail & Related papers (2025-09-14T08:35:39Z)
Motion-Aware Concept Alignment for Consistent Video Editing [57.08108545219043]
We introduce MoCA-Video (Motion-Aware Concept Alignment in Video), a training-free framework bridging the gap between image-domain semantic mixing and video.<n>Given a generated video and a user-provided reference image, MoCA-Video injects the semantic features of the reference image into a specific object within the video.<n>We evaluate MoCA's performance using the standard SSIM, image-level LPIPS, temporal LPIPS, and introduce a novel metric CASS (Conceptual Alignment Shift Score) to evaluate the consistency and effectiveness of the visual shifts between the source prompt and the modified video frames
arXiv Detail & Related papers (2025-06-01T13:28:04Z)
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation [26.888320234592978]
Zero-shot, training-free, image-based text-to-video generation is an emerging area that aims to generate videos using existing image-based diffusion models.<n>We provide a model-agnostic approach, using intersections in diffusion trajectories, working only with latent values.<n>An in-context trained LLM is used to generate coherent frame-wise prompts; another is used to identify differences between frames.<n>Our approach results in state-of-the-art performance while being more flexible when working with diverse image-generation models.
arXiv Detail & Related papers (2025-04-09T13:11:09Z)
Consistent Human Image and Video Generation with Spatially Conditioned Diffusion [82.4097906779699]
Consistent human-centric image and video synthesis aims to generate images with new poses while preserving appearance consistency with a given reference image.<n>We frame the task as a spatially-conditioned inpainting problem, where the target image is in-painted to maintain appearance consistency with the reference.<n>This approach enables the reference features to guide the generation of pose-compliant targets within a unified denoising network.
arXiv Detail & Related papers (2024-12-19T05:02:30Z)
DINTR: Tracking via Diffusion-based Interpolation [12.130669304428565]
This work proposes a novel diffusion-based methodology to formulate the tracking task. Our INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
arXiv Detail & Related papers (2024-10-14T00:41:58Z)
Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models. We propose a framework tailored for VSS based on pre-trained image and video diffusion models. Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z)
EmerDiff: Emerging Pixel-level Semantic Knowledge in Diffusion Models [52.3015009878545]
We develop an image segmentor capable of generating fine-grained segmentation maps without any additional training. Our framework identifies semantic correspondences between image pixels and spatial locations of low-dimensional feature maps. In extensive experiments, the produced segmentation maps are demonstrated to be well delineated and capture detailed parts of the images.
arXiv Detail & Related papers (2024-01-22T07:34:06Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Semantic Image Synthesis via Diffusion Models [174.24523061460704]
Denoising Diffusion Probabilistic Models (DDPMs) have achieved remarkable success in various image generation tasks.<n>Recent work on semantic image synthesis mainly follows the de facto GAN-based approaches.<n>We propose a novel framework based on DDPM for semantic image synthesis.
arXiv Detail & Related papers (2022-06-30T18:31:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.