VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer
- URL: http://arxiv.org/abs/2502.05979v4
- Date: Tue, 01 Apr 2025 07:54:57 GMT
- Title: VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer
- Authors: Xinyu Liu, Ailing Zeng, Wei Xue, Harry Yang, Wenhan Luo, Qifeng Liu, Yike Guo,
- Abstract summary: We propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images.<n>Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, and start-end timestamps for temporal control, and (ii) VFX Creator, a controllable VFX generation framework based on a Video Diffusion Transformer.
- Score: 56.81599836980222
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Crafting magic and illusions is one of the most thrilling aspects of filmmaking, with visual effects (VFX) serving as the powerhouse behind unforgettable cinematic experiences. While recent advances in generative artificial intelligence have driven progress in generic image and video synthesis, the domain of controllable VFX generation remains relatively underexplored. In this work, we propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images. Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, instance segmentation masks for spatial conditioning, and start-end timestamps for temporal control. (ii) VFX Creator, a simple yet effective controllable VFX generation framework based on a Video Diffusion Transformer. The model incorporates a spatial and temporal controllable LoRA adapter, requiring minimal training videos. Specifically, a plug-and-play mask control module enables instance-level spatial manipulation, while tokenized start-end motion timestamps embedded in the diffusion process, alongside the text encoder, allow precise temporal control over effect timing and pace. Extensive experiments on the Open-VFX test set demonstrate the superiority of the proposed system in generating realistic and dynamic effects, achieving state-of-the-art performance and generalization ability in both spatial and temporal controllability. Furthermore, we introduce a specialized metric to evaluate the precision of temporal control. By bridging traditional VFX techniques with generative approaches, VFX Creator unlocks new possibilities for efficient and high-quality video effect generation, making advanced VFX accessible to a broader audience.
Related papers
- Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z) - IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z) - VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning [67.44716618860544]
We introduce VFXMaster, the first unified, reference-based framework for VFX video generation.<n>It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content.<n>In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly.
arXiv Detail & Related papers (2025-10-29T17:59:53Z) - Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation [19.620765157987012]
We propose Omni-Effects, a framework capable of generating prompt-guided effects and spatially controllable composite effects.<n>LoRA-based Mixture of Experts (LoRA-MoE) employs a group of expert LoRAs, integrating diverse effects within a unified model.<n> spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control.
arXiv Detail & Related papers (2025-08-11T13:41:24Z) - Automated Video Segmentation Machine Learning Pipeline [1.3198143828338367]
This paper presents an automated video segmentation pipeline that creates temporally consistent instance masks.<n>It employs machine learning for: (1) flexible object detection via text prompts, (2) refined per-frame image segmentation and (3) robust video tracking to ensure temporal stability.
arXiv Detail & Related papers (2025-07-09T19:27:06Z) - PromptVFX: Text-Driven Fields for Open-World 3D Gaussian Animation [49.91188543847175]
We reformulate 3D animation as a field prediction task and introduce a text-driven framework that infers a time-varying 4D flow field acting on 3D Gaussians.<n>By leveraging large language models (LLMs) and vision-language models (VLMs) for function generation, our approach interprets arbitrary prompts and instantly updates color, opacity, and positions of 3D Gaussians in real time.
arXiv Detail & Related papers (2025-06-01T17:22:59Z) - Temporal Regularization Makes Your Video Generator Stronger [34.33572297364156]
Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames.
We introduce temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation.
Experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models.
arXiv Detail & Related papers (2025-03-19T16:59:32Z) - DiffuEraser: A Diffusion Model for Video Inpainting [13.292164408616257]
We introduce DiffuEraser, a video inpainting model based on stable diffusion, to fill masked regions with greater details and more coherent structures.<n>We also expand the temporal receptive fields of both the prior model and DiffuEraser, and further enhance consistency by leveraging the temporal smoothing property of Video Diffusion Models.
arXiv Detail & Related papers (2025-01-17T08:03:02Z) - MotionAura: Generating High-Quality and Motion Consistent Videos using Discrete Diffusion [3.7270979204213446]
We present four key contributions to address the challenges of video processing.
First, we introduce the 3D Inverted Vector-Quantization Variencoenco Autocoder.
Second, we present MotionAura, a text-to-video generation framework.
Third, we propose a spectral transformer-based denoising network.
Fourth, we introduce a downstream task of Sketch Guided Videopainting.
arXiv Detail & Related papers (2024-10-10T07:07:56Z) - VEnhancer: Generative Space-Time Enhancement for Video Generation [123.37212575364327]
VEnhancer improves the existing text-to-video results by adding more details in spatial domain and synthetic detailed motion in temporal domain.
We train a video ControlNet and inject it to the diffusion model as a condition on low frame-rate and low-resolution videos.
VEnhancer surpasses existing state-of-the-art video super-resolution and space-time super-resolution methods in enhancing AI-generated videos.
arXiv Detail & Related papers (2024-07-10T13:46:08Z) - LASER: Tuning-Free LLM-Driven Attention Control for Efficient Text-conditioned Image-to-Animation [52.16008431411513]
LASER is a tuning-free LLM-driven attention control framework.
We propose a Text-conditioned Image-to-Animation Benchmark to validate the effectiveness and efficacy of LASER.
arXiv Detail & Related papers (2024-04-21T07:13:56Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE.
Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume.
We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - Latent-Shift: Latent Diffusion with Temporal Shift for Efficient
Text-to-Video Generation [115.09597127418452]
Latent-Shift is an efficient text-to-video generation method based on a pretrained text-to-image generation model.
We show that Latent-Shift achieves comparable or better results while being significantly more efficient.
arXiv Detail & Related papers (2023-04-17T17:57:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.