IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control
- URL: http://arxiv.org/abs/2508.04147v1
- Date: Wed, 06 Aug 2025 07:19:16 GMT
- Title: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control
- Authors: Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao,
- Abstract summary: IDC-Net is a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control.<n>We show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences.
- Score: 11.830304371371968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.
Related papers
- Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context [33.99324999592141]
Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory.<n>Previous methods rely on video generation models with external memory for consistency.<n>We introduce geometry-as-context" to overcome these limitations.
arXiv Detail & Related papers (2026-02-25T14:09:03Z) - CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization [32.42754288735215]
CETCAM is a camera-controllable video generation framework.<n>It eliminates the need for camera annotations through a consistent and tokenization scheme.<n>It learns robust camera controllability from diverse raw video data and refines fine-grained visual quality using high-fidelity datasets.
arXiv Detail & Related papers (2025-12-22T04:21:39Z) - Generative Point Cloud Registration [39.19949818461193]
We propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks.<n>Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds.<n>To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model.
arXiv Detail & Related papers (2025-12-10T08:01:20Z) - DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation [51.66285725139235]
We present DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation.<n>We propose a dual-branch framework that mutually generates camera-consistent RGB and depth sequences.<n>Experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation.
arXiv Detail & Related papers (2025-11-28T12:19:57Z) - AutoScape: Geometry-Consistent Long-Horizon Scene Generation [69.2451355181344]
AutoScape is a long-horizon driving scene generation framework.<n>It generates realistic and geometrically consistent driving videos of over 20 seconds.<n>It improves the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
arXiv Detail & Related papers (2025-10-23T16:44:34Z) - CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image [44.8172828045897]
Current methods often struggle with domain-specific limitations or low-quality object generation.<n>We propose CAST, a novel method for 3D scene reconstruction and recovery.
arXiv Detail & Related papers (2025-02-18T14:29:52Z) - Discovering an Image-Adaptive Coordinate System for Photography Processing [51.164345878060956]
We propose a novel algorithm, IAC, to learn an image-adaptive coordinate system in the RGB color space before performing curve operations.<n>This end-to-end trainable approach enables us to efficiently adjust images with a jointly learned image-adaptive coordinate system and curves.
arXiv Detail & Related papers (2025-01-11T06:20:07Z) - GenRC: Generative 3D Room Completion from Sparse Image Collections [17.222652213723485]
GenRC is an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures.
E-Diffusion generates a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency.
GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets.
arXiv Detail & Related papers (2024-07-17T18:10:40Z) - VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We show how to tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism.<n>Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z) - PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
We introduce PerLDiff, a novel method for effective street view image generation that fully leverages perspective 3D geometric information.<n>PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process.<n> Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z) - SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing.
SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent.
Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z) - ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames.
All the frames in each video are manually annotated to a high-quality saliency annotation.
We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z) - FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything [1.5728609542259502]
This paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery.
The proposed FusionVision pipeline employs YOLO for identifying objects within the RGB image domain.
The synergy between these components and their integration into 3D scene understanding ensures a cohesive fusion of object detection and segmentation.
arXiv Detail & Related papers (2024-02-29T22:59:27Z) - EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion [55.367269556557645]
EvPlug learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model.
We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.
arXiv Detail & Related papers (2023-12-28T10:05:13Z) - Anyview: Generalizable Indoor 3D Object Detection with Variable Frames [60.48134767838629]
We present a novel 3D detection framework named AnyView for our practical applications.<n>Our method achieves both great generalizability and high detection accuracy with a simple and clean architecture.
arXiv Detail & Related papers (2023-10-09T02:15:45Z) - ODAM: Object Detection, Association, and Mapping using Posed RGB Video [36.16010611723447]
We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos.
The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN)
arXiv Detail & Related papers (2021-08-23T13:28:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.