Related papers: Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation

URL: http://arxiv.org/abs/2508.07981v2
Date: Tue, 12 Aug 2025 03:46:18 GMT
Title: Omni-Effects: Unified and Spatially-Controllable Visual Effects Generation
Authors: Fangyuan Mao, Aiming Hao, Jintao Chen, Dongxia Liu, Xiaokun Feng, Jiashu Zhu, Meiqi Wu, Chubin Chen, Jiahong Wu, Xiangxiang Chu,
Abstract summary: We propose Omni-Effects, a framework capable of generating prompt-guided effects and spatially controllable composite effects.<n>LoRA-based Mixture of Experts (LoRA-MoE) employs a group of expert LoRAs, integrating diverse effects within a unified model.<n> spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control.
Score: 11.41864836442447
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Visual effects (VFX) are essential visual enhancements fundamental to modern cinematic production. Although video generation models offer cost-efficient solutions for VFX production, current methods are constrained by per-effect LoRA training, which limits generation to single effects. This fundamental limitation impedes applications that require spatially controllable composite effects, i.e., the concurrent generation of multiple effects at designated locations. However, integrating diverse effects into a unified framework faces major challenges: interference from effect variations and spatial uncontrollability during multi-VFX joint training. To tackle these challenges, we propose Omni-Effects, a first unified framework capable of generating prompt-guided effects and spatially controllable composite effects. The core of our framework comprises two key innovations: (1) LoRA-based Mixture of Experts (LoRA-MoE), which employs a group of expert LoRAs, integrating diverse effects within a unified model while effectively mitigating cross-task interference. (2) Spatial-Aware Prompt (SAP) incorporates spatial mask information into the text token, enabling precise spatial control. Furthermore, we introduce an Independent-Information Flow (IIF) module integrated within the SAP, isolating the control signals corresponding to individual effects to prevent any unwanted blending. To facilitate this research, we construct a comprehensive VFX dataset Omni-VFX via a novel data collection pipeline combining image editing and First-Last Frame-to-Video (FLF2V) synthesis, and introduce a dedicated VFX evaluation framework for validating model performance. Extensive experiments demonstrate that Omni-Effects achieves precise spatial control and diverse effect generation, enabling users to specify both the category and location of desired effects.

Related papers

Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z)
EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition [26.91723676903844]
We introduce Easy Omnimatte, the first unified, end-to-end video omnimatte method.<n>We finetune a video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact.<n>During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps.
arXiv Detail & Related papers (2025-12-26T04:57:59Z)
IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z)
Generative Photographic Control for Scene-Consistent Video Cinematic Editing [75.45726688666083]
We propose CineCtrl, the first video cinematic editing framework that provides fine control over professional camera parameters.<n>We introduce a decoupled cross-attention mechanism to disentangle camera motion from photographic inputs.<n>Our model generates high-fidelity videos with precisely controlled, user-specified photographic camera effects.
arXiv Detail & Related papers (2025-11-17T03:17:23Z)
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning [67.44716618860544]
We introduce VFXMaster, the first unified, reference-based framework for VFX video generation.<n>It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content.<n>In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly.
arXiv Detail & Related papers (2025-10-29T17:59:53Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [92.4205087439928]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose the Self-supervised Transfer (PST) and the FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models, effectively mitigating data scarcity.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.<n>This combined approach enables FUSE to construct a universal image-event that only requires lightweight decoder adaptation for target datasets.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
VFX Creator: Animated Visual Effect Generation with Controllable Diffusion Transformer [56.81599836980222]
We propose a novel paradigm for animated VFX generation as image animation, where dynamic effects are generated from user-friendly textual descriptions and static reference images.<n>Our work makes two primary contributions: (i) Open-VFX, the first high-quality VFX video dataset spanning 15 diverse effect categories, annotated with textual descriptions, and start-end timestamps for temporal control, and (ii) VFX Creator, a controllable VFX generation framework based on a Video Diffusion Transformer.
arXiv Detail & Related papers (2025-02-09T18:12:25Z)
Free-Form Motion Control: Controlling the 6D Poses of Camera and Objects in Video Generation [78.65431951506152]
We introduce a Synthetic dataset for Free-Form Motion Control (SynFMC)<n>The proposed SynFMC dataset includes diverse object and environment categories.<n>It covers various motion patterns according to specific rules, simulating common and complex real-world scenarios.<n>The complete 6D pose information facilitates models learning to disentangle the motion effects from objects and the camera in a video.
arXiv Detail & Related papers (2025-01-02T18:59:45Z)
CONMOD: Controllable Neural Frame-based Modulation Effects [6.132272910797383]
We introduce Controllable Neural Frame-based Modulation Effects (CONMOD), a single black-box model which emulates various LFO-driven effects in a frame-wise manner. The model is capable of learning the continuous embedding space of two distinct phaser effects, enabling us to steer between effects and achieve creative outputs.
arXiv Detail & Related papers (2024-06-20T02:02:54Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Interactive Character Control with Auto-Regressive Motion Diffusion Models [18.727066177880708]
We propose A-MDM (Auto-regressive Motion Diffusion Model) for real-time motion synthesis. Our conditional diffusion model takes an initial pose as input, and auto-regressively generates successive motion frames conditioned on previous frame. We introduce a suite of techniques for incorporating interactive controls into A-MDM, such as task-oriented sampling, in-painting, and hierarchical reinforcement learning.
arXiv Detail & Related papers (2023-06-01T07:48:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.