Related papers: V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties

URL: http://arxiv.org/abs/2512.11799v1
Date: Fri, 12 Dec 2025 18:59:54 GMT
Title: V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
Authors: Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang,
Abstract summary: We present V-RGBX, the first end-to-end framework for editable video editing.<n>V-RGBX unifies three key capabilities: video inverse rendering into intrinsic channels, video synthesis from these intrinsic representations, and editable-based video editing conditioned on intrinsic channels.<n>We show that V-RGBX produces temporally consistent, photorealistic videos while propagating intrinsic appearance edits across sequences in a physically plausible manner.
Score: 31.579053991884845
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.

Related papers

VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats [53.602701067430075]
We introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS.<n>By fine-tuning the weights of a single user, the color edits are seamlessly propagated to the entire scene in just two seconds.<n>An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors.
arXiv Detail & Related papers (2026-03-03T13:41:17Z)
Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z)
X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering [25.939894201559426]
X2Video is the first diffusion model for guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance.<n>It supports intuitive multi-modal controls with reference images and text prompts for both global and local regions.<n>X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions.
arXiv Detail & Related papers (2025-10-09T17:50:31Z)
Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer [41.82610275115671]
We present ColorCtrl, a training-free color editing method.<n>By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing.<n>Our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency.
arXiv Detail & Related papers (2025-08-12T17:57:04Z)
IntrinsicEdit: Precise generative image manipulation in intrinsic space [53.404235331886255]
We introduce a versatile, generative workflow that operates in an intrinsic-image latent space.<n>We address key challenges of identity preservation and intrinsic-channel entanglement.<n>We enable precise, efficient editing with automatic resolution of global illumination effects.
arXiv Detail & Related papers (2025-05-13T18:24:15Z)
SketchVideo: Sketch-based Video Generation and Editing [51.99066098393491]
We aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos.<n>Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks.<n>For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion.
arXiv Detail & Related papers (2025-03-30T02:44:09Z)
DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models [83.28670336340608]
We introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering.<n>Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.
arXiv Detail & Related papers (2025-01-30T18:59:11Z)
MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
MagicProp: Diffusion-based Video Editing via Motion-aware Appearance Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation. In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame. In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z)
RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images. Existing methods invert video frames individually often leading to undesired inconsistent results over time. We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID) Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.