Interactive Video Generation via Domain Adaptation
- URL: http://arxiv.org/abs/2505.24253v1
- Date: Fri, 30 May 2025 06:19:47 GMT
- Title: Interactive Video Generation via Domain Adaptation
- Authors: Ishaan Rawal, Suryansh Kumar,
- Abstract summary: Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation.<n>Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades quality.<n>We identify two key modes failure in these methods, both of which we interpret as domain problems.
- Score: 7.397099215417549
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-conditioned diffusion models have emerged as powerful tools for high-quality video generation. However, enabling Interactive Video Generation (IVG), where users control motion elements such as object trajectory, remains challenging. Recent training-free approaches introduce attention masking to guide trajectory, but this often degrades perceptual quality. We identify two key failure modes in these methods, both of which we interpret as domain shift problems, and propose solutions inspired by domain adaptation. First, we attribute the perceptual degradation to internal covariate shift induced by attention masking, as pretrained models are not trained to handle masked attention. To address this, we propose mask normalization, a pre-normalization layer designed to mitigate this shift via distribution matching. Second, we address initialization gap, where the randomly sampled initial noise does not align with IVG conditioning, by introducing a temporal intrinsic diffusion prior that enforces spatio-temporal consistency at each denoising step. Extensive qualitative and quantitative evaluations demonstrate that mask normalization and temporal intrinsic denoising improve both perceptual quality and trajectory control over the existing state-of-the-art IVG techniques.
Related papers
- Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - On Denoising Walking Videos for Gait Recognition [10.905636016507994]
We propose DenoisingGait, a novel gait denoising method.<n>Inspired by the philosophy that "what I cannot create, I do not understand", we turn to generative diffusion models.<n>DenoisingGait achieves a new SoTA performance in most cases for both within- and cross-domain evaluations.
arXiv Detail & Related papers (2025-05-24T08:17:34Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - Dual Conditioned Motion Diffusion for Pose-Based Video Anomaly Detection [12.100563798908777]
Video Anomaly Detection (VAD) is essential for computer vision research.<n>Existing VAD methods utilize either reconstruction-based or prediction-based frameworks.<n>We address pose-based video anomaly detection and introduce a novel framework called Dual Conditioned Motion Diffusion.
arXiv Detail & Related papers (2024-12-23T01:31:39Z) - Frequency-Guided Diffusion Model with Perturbation Training for Skeleton-Based Video Anomaly Detection [43.49146665908238]
Video anomaly detection (VAD) is a vital yet complex open-set task in computer vision.<n>We introduce a novel frequency-guided diffusion model with perturbation training.<n>We employ the 2D Discrete Cosine Transform (DCT) to separate high-frequency (local) and low-frequency (global) motion components.
arXiv Detail & Related papers (2024-12-04T05:43:53Z) - UniForensics: Face Forgery Detection via General Facial Representation [60.5421627990707]
High-level semantic features are less susceptible to perturbations and not limited to forgery-specific artifacts, thus having stronger generalization.
We introduce UniForensics, a novel deepfake detection framework that leverages a transformer-based video network, with a meta-functional face classification for enriched facial representation.
arXiv Detail & Related papers (2024-07-26T20:51:54Z) - Denoising as Adaptation: Noise-Space Domain Adaptation for Image Restoration [64.84134880709625]
We show that it is possible to perform domain adaptation via the noise space using diffusion models.<n>In particular, by leveraging the unique property of how auxiliary conditional inputs influence the multi-step denoising process, we derive a meaningful diffusion loss.<n>We present crucial strategies such as channel-shuffling layer and residual-swapping contrastive learning in the diffusion model.
arXiv Detail & Related papers (2024-06-26T17:40:30Z) - Denoising Diffusion Semantic Segmentation with Mask Prior Modeling [61.73352242029671]
We propose to ameliorate the semantic segmentation quality of existing discriminative approaches with a mask prior modeled by a denoising diffusion generative model.
We evaluate the proposed prior modeling with several off-the-shelf segmentors, and our experimental results on ADE20K and Cityscapes demonstrate that our approach could achieve competitively quantitative performance.
arXiv Detail & Related papers (2023-06-02T17:47:01Z) - DiffusionAD: Norm-guided One-step Denoising Diffusion for Anomaly Detection [80.20339155618612]
DiffusionAD is a novel anomaly detection pipeline comprising a reconstruction sub-network and a segmentation sub-network.<n>A rapid one-step denoising paradigm achieves hundreds of times acceleration while preserving comparable reconstruction quality.<n>Considering the diversity in the manifestation of anomalies, we propose a norm-guided paradigm to integrate the benefits of multiple noise scales.
arXiv Detail & Related papers (2023-03-15T16:14:06Z) - Unsupervised Video Domain Adaptation for Action Recognition: A
Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors.
TranSVAE framework is then developed to model such generation.
Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z) - Generative Domain Adaptation for Face Anti-Spoofing [38.12738183385737]
Face anti-spoofing approaches based on unsupervised domain adaption (UDA) have drawn growing attention due to promising performances for target scenarios.
Most existing UDA FAS methods typically fit the trained models to the target domain via aligning the distribution of semantic high-level features.
We propose a novel perspective of UDA FAS that directly fits the target data to the models, stylizes the target data to the source-domain style via image translation, and further feeds the stylized data into the well-trained source model for classification.
arXiv Detail & Related papers (2022-07-20T16:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.