V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
- URL: http://arxiv.org/abs/2512.11799v1
- Date: Fri, 12 Dec 2025 18:59:54 GMT
- Title: V-RGBX: Video Editing with Accurate Controls over Intrinsic Properties
- Authors: Ye Fang, Tong Wu, Valentin Deschaintre, Duygu Ceylan, Iliyan Georgiev, Chun-Hao Paul Huang, Yiwei Hu, Xuelin Chen, Tuanfeng Yang Wang,
- Abstract summary: We present V-RGBX, the first end-to-end framework for editable video editing.<n>V-RGBX unifies three key capabilities: video inverse rendering into intrinsic channels, video synthesis from these intrinsic representations, and editable-based video editing conditioned on intrinsic channels.<n>We show that V-RGBX produces temporally consistent, photorealistic videos while propagating intrinsic appearance edits across sequences in a physically plausible manner.
- Score: 31.579053991884845
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale video generation models have shown remarkable potential in modeling photorealistic appearance and lighting interactions in real-world scenes. However, a closed-loop framework that jointly understands intrinsic scene properties (e.g., albedo, normal, material, and irradiance), leverages them for video synthesis, and supports editable intrinsic representations remains unexplored. We present V-RGBX, the first end-to-end framework for intrinsic-aware video editing. V-RGBX unifies three key capabilities: (1) video inverse rendering into intrinsic channels, (2) photorealistic video synthesis from these intrinsic representations, and (3) keyframe-based video editing conditioned on intrinsic channels. At the core of V-RGBX is an interleaved conditioning mechanism that enables intuitive, physically grounded video editing through user-selected keyframes, supporting flexible manipulation of any intrinsic modality. Extensive qualitative and quantitative results show that V-RGBX produces temporally consistent, photorealistic videos while propagating keyframe edits across sequences in a physically plausible manner. We demonstrate its effectiveness in diverse applications, including object appearance editing and scene-level relighting, surpassing the performance of prior methods.
Related papers
- VIRGi: View-dependent Instant Recoloring of 3D Gaussians Splats [53.602701067430075]
We introduce VIRGi, a novel approach for rapidly editing the color of scenes modeled by 3DGS.<n>By fine-tuning the weights of a single user, the color edits are seamlessly propagated to the entire scene in just two seconds.<n>An exhaustive validation on diverse datasets demonstrates significant quantitative and qualitative advancements over competitors.
arXiv Detail & Related papers (2026-03-03T13:41:17Z) - Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z) - X2Video: Adapting Diffusion Models for Multimodal Controllable Neural Video Rendering [25.939894201559426]
X2Video is the first diffusion model for guided by intrinsic channels including albedo, normal, roughness, metallicity, and irradiance.<n>It supports intuitive multi-modal controls with reference images and text prompts for both global and local regions.<n>X2Video can produce long, temporally consistent, and photorealistic videos guided by intrinsic conditions.
arXiv Detail & Related papers (2025-10-09T17:50:31Z) - Training-Free Text-Guided Color Editing with Multi-Modal Diffusion Transformer [41.82610275115671]
We present ColorCtrl, a training-free color editing method.<n>By disentangling structure and color through targeted manipulation of attention maps and value tokens, our method enables accurate and consistent color editing.<n>Our method surpasses strong commercial models such as FLUX.1 Kontext Max and GPT-4o Image Generation in terms of consistency.
arXiv Detail & Related papers (2025-08-12T17:57:04Z) - IntrinsicEdit: Precise generative image manipulation in intrinsic space [53.404235331886255]
We introduce a versatile, generative workflow that operates in an intrinsic-image latent space.<n>We address key challenges of identity preservation and intrinsic-channel entanglement.<n>We enable precise, efficient editing with automatic resolution of global illumination effects.
arXiv Detail & Related papers (2025-05-13T18:24:15Z) - SketchVideo: Sketch-based Video Generation and Editing [51.99066098393491]
We aim to achieve sketch-based spatial and motion control for video generation and support fine-grained editing of real or synthetic videos.<n>Based on the DiT video generation model, we propose a memory-efficient control structure with sketch control blocks that predict residual features of skipped DiT blocks.<n>For sketch-based video editing, we design an additional video insertion module that maintains consistency between the newly edited content and the original video's spatial feature and dynamic motion.
arXiv Detail & Related papers (2025-03-30T02:44:09Z) - DiffusionRenderer: Neural Inverse and Forward Rendering with Video Diffusion Models [83.28670336340608]
We introduce DiffusionRenderer, a neural approach that addresses the dual problem of inverse and forward rendering.<n>Our model enables practical applications from a single video input--including relighting, material editing, and realistic object insertion.
arXiv Detail & Related papers (2025-01-30T18:59:11Z) - MAKIMA: Tuning-free Multi-Attribute Open-domain Video Editing via Mask-Guided Attention Modulation [55.101611012677616]
Diffusion-based text-to-image (T2I) models have demonstrated remarkable results in global video editing tasks.<n>We present MAKIMA, a tuning-free MAE framework built upon pretrained T2I models for open-domain video editing.
arXiv Detail & Related papers (2024-12-28T02:36:51Z) - I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model.
Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z) - MagicProp: Diffusion-based Video Editing via Motion-aware Appearance
Propagation [74.32046206403177]
MagicProp disentangles the video editing process into two stages: appearance editing and motion-aware appearance propagation.
In the first stage, MagicProp selects a single frame from the input video and applies image-editing techniques to modify the content and/or style of the frame.
In the second stage, MagicProp employs the edited frame as an appearance reference and generates the remaining frames using an autoregressive rendering approach.
arXiv Detail & Related papers (2023-09-02T11:13:29Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.