Related papers: UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition

UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition

URL: http://arxiv.org/abs/2509.21086v1
Date: Thu, 25 Sep 2025 12:39:06 GMT
Title: UniTransfer: Video Concept Transfer via Progressive Spatial and Timestep Decomposition
Authors: Guojun Lei, Rong Zhang, Chi Wang, Tianhang Liu, Hong Li, Zhiyuan Ma, Weiwei Xu,
Abstract summary: We propose a novel architecture UniTransfer, achieving precise and controllable video concept transfer.<n>In terms of spatial decomposition, we decouple videos into three key components: the subject, the background, and the motion flow.<n>We also introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos.
Score: 27.259262849397913
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We propose a novel architecture UniTransfer, which introduces both spatial and diffusion timestep decomposition in a progressive paradigm, achieving precise and controllable video concept transfer. Specifically, in terms of spatial decomposition, we decouple videos into three key components: the foreground subject, the background, and the motion flow. Building upon this decomposed formulation, we further introduce a dual-to-single-stream DiT-based architecture for supporting fine-grained control over different components in the videos. We also introduce a self-supervised pretraining strategy based on random masking to enhance the decomposed representation learning from large-scale unlabeled video data. Inspired by the Chain-of-Thought reasoning paradigm, we further revisit the denoising diffusion process and propose a Chain-of-Prompt (CoP) mechanism to achieve the timestep decomposition. We decompose the denoising process into three stages of different granularity and leverage large language models (LLMs) for stage-specific instructions to guide the generation progressively. We also curate an animal-centric video dataset called OpenAnimal to facilitate the advancement and benchmarking of research in video concept transfer. Extensive experiments demonstrate that our method achieves high-quality and controllable video concept transfer across diverse reference images and scenes, surpassing existing baselines in both visual fidelity and editability. Web Page: https://yu-shaonian.github.io/UniTransfer-Web/

Related papers

Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation [8.108805590363392]
Tora is a diffusion transformer model for motion-guided video generation.<n>Tora2 introduces several design improvements to expand its capabilities in both appearance and motion customization.<n>Tora2 is the first method to achieve simultaneous multi-entity customization of appearance and motion for video generation.
arXiv Detail & Related papers (2025-07-08T13:11:40Z)
MambaVideo for Discrete Video Tokenization with Channel-Split Quantization [34.23941517563312]
This work introduces a state-of-the-art discrete video tokenizer with two key contributions.<n>First, we propose a novel Mamba-based encoder-decoder architecture that overcomes the limitations of previous sequencebased tokenizers.<n>Second, we introduce a new quantization scheme, channel-split quantization, which significantly enhances the representational power of quantized latents.
arXiv Detail & Related papers (2025-07-06T22:23:27Z)
OmniVCus: Feedforward Subject-driven Video Customization with Multimodal Control Conditions [96.31455979495398]
We develop an Image-Video Transfer Mixed (IVTM) training with image editing data to enable instructive editing for the subject in the customized video.<n>We also propose a diffusion Transformer framework, OmniVCus, with two embedding mechanisms, Lottery Embedding (LE) and Temporally Aligned Embedding (TAE)<n>Our method significantly surpasses state-of-the-art methods in both quantitative and qualitative evaluations.
arXiv Detail & Related papers (2025-06-29T18:43:00Z)
UniVST: A Unified Framework for Training-free Localized Video Style Transfer [102.52552893495475]
This paper presents UniVST, a unified framework for localized video style transfer based on diffusion models.<n>It operates without the need for training, offering a distinct advantage over existing diffusion methods that transfer style across entire videos.
arXiv Detail & Related papers (2024-10-26T05:28:02Z)
Vivid-ZOO: Multi-View Video Generation with Diffusion Model [76.96449336578286]
New challenges lie in the lack of massive captioned multi-view videos and the complexity of modeling such multi-dimensional distribution. We propose a novel diffusion-based pipeline that generates high-quality multi-view videos centered around a dynamic 3D object from text.
arXiv Detail & Related papers (2024-06-12T21:44:04Z)
Training-Free Semantic Video Composition via Pre-trained Diffusion Model [96.0168609879295]
Current approaches, predominantly trained on videos with adjusted foreground color and lighting, struggle to address deep semantic disparities beyond superficial adjustments. We propose a training-free pipeline employing a pre-trained diffusion model imbued with semantic prior knowledge. Experimental results reveal that our pipeline successfully ensures the visual harmony and inter-frame coherence of the outputs.
arXiv Detail & Related papers (2024-01-17T13:07:22Z)
Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps. We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process. Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
Unsupervised Video Domain Adaptation for Action Recognition: A Disentanglement Perspective [37.45565756522847]
We consider the generation of cross-domain videos from two sets of latent factors. TranSVAE framework is then developed to model such generation. Experiments on the UCF-HMDB, Jester, and Epic-Kitchens datasets verify the effectiveness and superiority of TranSVAE.
arXiv Detail & Related papers (2022-08-15T17:59:31Z)
Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video. Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content. We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z)
Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.