OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
- URL: http://arxiv.org/abs/2601.14250v1
- Date: Tue, 20 Jan 2026 18:58:11 GMT
- Title: OmniTransfer: All-in-one Framework for Spatio-temporal Video Transfer
- Authors: Pengze Zhang, Yanze Wu, Mengtian Li, Xu Bai, Songtao Zhao, Fulong Ye, Chong Mou, Xinghui Li, Zhuowei Chen, Qian He, Mingyuan Gao,
- Abstract summary: We propose Omni-temporal framework for unified video transfer.<n>It leverages multi-view information across video frames to enhance appearance consistency.<n>It exploits temporal cues to enable fine-grained temporal control.
- Score: 38.324957777123664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos convey richer information than images or text, capturing both spatial and temporal dynamics. However, most existing video customization methods rely on reference images or task-specific temporal priors, failing to fully exploit the rich spatio-temporal information inherent in videos, thereby limiting flexibility and generalization in video generation. To address these limitations, we propose OmniTransfer, a unified framework for spatio-temporal video transfer. It leverages multi-view information across frames to enhance appearance consistency and exploits temporal cues to enable fine-grained temporal control. To unify various video transfer tasks, OmniTransfer incorporates three key designs: Task-aware Positional Bias that adaptively leverages reference video information to improve temporal alignment or appearance consistency; Reference-decoupled Causal Learning separating reference and target branches to enable precise reference transfer while improving efficiency; and Task-adaptive Multimodal Alignment using multimodal semantic guidance to dynamically distinguish and tackle different tasks. Extensive experiments show that OmniTransfer outperforms existing methods in appearance (ID and style) and temporal transfer (camera movement and video effects), while matching pose-guided methods in motion transfer without using pose, establishing a new paradigm for flexible, high-fidelity video generation.
Related papers
- Tele-Omni: a Unified Multimodal Framework for Video Generation and Editing [93.8111348452324]
Tele- Omni is a unified framework for video generation and editing that follows multimodal instructions.<n>It supports text-to-video generation, image-to-video generation, first-last-frame video generation, in-context video generation, and in-context video editing.
arXiv Detail & Related papers (2026-02-10T10:01:16Z) - Motion Marionette: Rethinking Rigid Motion Transfer via Prior Guidance [26.642143303176997]
Motion Marionette is a framework for rigid motion transfer from monocular source videos to single-view target images.<n> Motion trajectories are extracted from the source video to construct a spatial-temporal (SpaT) prior.<n>The resulting velocity field can be flexibly employed for efficient video production.
arXiv Detail & Related papers (2025-11-25T04:34:42Z) - Versatile Transition Generation with Image-to-Video Diffusion [89.67070538399457]
We present a Versatile Transition video Generation framework that can generate smooth, high-fidelity, and semantically coherent video transitions.<n>We show that VTG achieves superior transition performance consistently across all four tasks.
arXiv Detail & Related papers (2025-08-03T10:03:56Z) - TransFlow: Motion Knowledge Transfer from Video Diffusion Models to Video Salient Object Detection [14.635179908525389]
We present TransFlow, which transfers motion knowledge from pre-trained video diffusion models to generate realistic training data for video salient object detection.<n>Our method achieves improved performance across multiple benchmarks, demonstrating effective motion knowledge transfer.
arXiv Detail & Related papers (2025-07-26T04:30:44Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - EfficientMT: Efficient Temporal Adaptation for Motion Transfer in Text-to-Video Diffusion Models [73.96414072072048]
Existing motion transfer methods explored the motion representations of reference videos to guide generation.<n>We propose EfficientMT, a novel and efficient end-to-end framework for video motion transfer.<n>Our experiments demonstrate that our EfficientMT outperforms existing methods in efficiency while maintaining flexible motion controllability.
arXiv Detail & Related papers (2025-03-25T05:51:14Z) - CustomVideoX: 3D Reference Attention Driven Dynamic Adaptation for Zero-Shot Customized Video Diffusion Transformers [15.558659099600822]
CustomVideoX capitalizes on pre-trained video networks by exclusively training the LoRA parameters to extract reference features.<n>We propose 3D Reference Attention, which enables direct and simultaneous engagement of reference image features.<n> Experimental results show that CustomVideoX significantly outperforms existing methods in terms of video consistency and quality.
arXiv Detail & Related papers (2025-02-10T14:50:32Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - End-to-end Multi-modal Video Temporal Grounding [105.36814858748285]
We propose a multi-modal framework to extract complementary information from videos.
We adopt RGB images for appearance, optical flow for motion, and depth maps for image structure.
We conduct experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.
arXiv Detail & Related papers (2021-07-12T17:58:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.