Tuning-free Visual Effect Transfer across Videos
- URL: http://arxiv.org/abs/2601.07833v3
- Date: Wed, 14 Jan 2026 04:31:07 GMT
- Title: Tuning-free Visual Effect Transfer across Videos
- Authors: Maxwell Jones, Rameen Abdal, Or Patashnik, Ruslan Salakhutdinov, Sergey Tulyakov, Jun-Yan Zhu, Kuan-Chieh Jackson Wang,
- Abstract summary: RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
- Score: 91.93897438317397
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present RefVFX, a new framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner. While existing methods excel at prompt-based or keyframe-conditioned editing, they struggle with dynamic temporal effects such as dynamic lighting changes or character transformations, which are difficult to describe via text or static conditions. Transferring a video effect is challenging, as the model must integrate the new temporal dynamics with the input video's existing motion and appearance. % To address this, we introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video depicting the transferred effect. Creating this data is non-trivial, especially the video-to-video effect triplets, which do not exist naturally. To generate these, we propose a scalable automated pipeline that creates high-quality paired videos designed to preserve the input's motion and structure while transforming it based on some fixed, repeatable effect. We then augment this data with image-to-video effects derived from LoRA adapters and code-based temporal effects generated through programmatic composition. Building on our new dataset, we train our reference-conditioned model using recent text-to-video backbones. Experimental results demonstrate that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference. See our website at https://tuningfreevisualeffects-maker.github.io/Tuning-free-Visual-Effect-Transfer-across-Videos-Pro ject-Page/
Related papers
- IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z) - Vectorized Video Representation with Easy Editing via Hierarchical Spatio-Temporally Consistent Proxy Embedding [45.593989778240655]
A proposed representation achieves high video reconstruction accuracy with fewer parameters.<n>It supports complex video processing tasks, including video in-painting and temporally consistent video editing.
arXiv Detail & Related papers (2025-10-14T08:05:30Z) - GenCompositor: Generative Video Compositing with Diffusion Transformer [68.00271033575736]
Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs.<n>This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner.<n>Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
arXiv Detail & Related papers (2025-09-02T16:10:13Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer [56.81599836980222]
We propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images.<n>Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, and start-end timestamps for temporal control, and (ii) VFX Creator, a controllable VFX generation framework based on a Video Diffusion Transformer.
arXiv Detail & Related papers (2025-02-09T18:12:25Z) - Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation.<n>Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z) - MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.<n>First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.<n>Second, we present MotionAura, a text-to-video generation framework.<n>Third, we propose a spectral transformer-based denoising network.<n>Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - Data Collection-free Masked Video Modeling [6.641717260925999]
We introduce an effective self-supervised learning framework for videos that leverages and less costly static images.
These pseudo-motion videos are then leveraged in masked video modeling.
Our approach is applicable to synthetic images as well, thus entirely freeing video-training from data collection costs other concerns in real data.
arXiv Detail & Related papers (2024-09-10T17:34:07Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - You Can Ground Earlier than See: An Effective and Efficient Pipeline for
Temporal Sentence Grounding in Compressed Videos [56.676761067861236]
Given an untrimmed video, temporal sentence grounding aims to locate a target moment semantically according to a sentence query.
Previous respectable works have made decent success, but they only focus on high-level visual features extracted from decoded frames.
We propose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input.
arXiv Detail & Related papers (2023-03-14T12:53:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.