Related papers: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control

URL: http://arxiv.org/abs/2508.04147v1
Date: Wed, 06 Aug 2025 07:19:16 GMT
Title: IDCNet: Guided Video Diffusion for Metric-Consistent RGBD Scene Generation with Precise Camera Control
Authors: Lijuan Liu, Wenfa Li, Dongbo Zhang, Shuo Wang, Shaohui Jiao,
Abstract summary: IDC-Net is a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control.<n>We show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences.
Score: 11.830304371371968
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present IDC-Net (Image-Depth Consistency Network), a novel framework designed to generate RGB-D video sequences under explicit camera trajectory control. Unlike approaches that treat RGB and depth generation separately, IDC-Net jointly synthesizes both RGB images and corresponding depth maps within a unified geometry-aware diffusion model. The joint learning framework strengthens spatial and geometric alignment across frames, enabling more precise camera control in the generated sequences. To support the training of this camera-conditioned model and ensure high geometric fidelity, we construct a camera-image-depth consistent dataset with metric-aligned RGB videos, depth maps, and accurate camera poses, which provides precise geometric supervision with notably improved inter-frame geometric consistency. Moreover, we introduce a geometry-aware transformer block that enables fine-grained camera control, enhancing control over the generated sequences. Extensive experiments show that IDC-Net achieves improvements over state-of-the-art approaches in both visual quality and geometric consistency of generated scene sequences. Notably, the generated RGB-D sequences can be directly feed for downstream 3D Scene reconstruction tasks without extra post-processing steps, showcasing the practical benefits of our joint learning framework. See more at https://idcnet-scene.github.io.

Related papers

Geometry-as-context: Modulating Explicit 3D in Scene-consistent Video Generation to Geometry Context [33.99324999592141]
Scene-consistent video generation aims to create videos that explore 3D scenes based on a camera trajectory.<n>Previous methods rely on video generation models with external memory for consistency.<n>We introduce geometry-as-context" to overcome these limitations.
arXiv Detail & Related papers (2026-02-25T14:09:03Z)
CETCAM: Camera-Controllable Video Generation via Consistent and Extensible Tokenization [32.42754288735215]
CETCAM is a camera-controllable video generation framework.<n>It eliminates the need for camera annotations through a consistent and tokenization scheme.<n>It learns robust camera controllability from diverse raw video data and refines fine-grained visual quality using high-fidelity datasets.
arXiv Detail & Related papers (2025-12-22T04:21:39Z)
Generative Point Cloud Registration [39.19949818461193]
We propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks.<n>Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds.<n>To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model.
arXiv Detail & Related papers (2025-12-10T08:01:20Z)
DualCamCtrl: Dual-Branch Diffusion Model for Geometry-Aware Camera-Controlled Video Generation [51.66285725139235]
We present DualCamCtrl, a novel end-to-end diffusion model for camera-controlled video generation.<n>We propose a dual-branch framework that mutually generates camera-consistent RGB and depth sequences.<n>Experiments demonstrate that DualCamCtrl achieves more consistent camera-controlled video generation.
arXiv Detail & Related papers (2025-11-28T12:19:57Z)
AutoScape: Geometry-Consistent Long-Horizon Scene Generation [69.2451355181344]
AutoScape is a long-horizon driving scene generation framework.<n>It generates realistic and geometrically consistent driving videos of over 20 seconds.<n>It improves the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6% and 43.0%, respectively.
arXiv Detail & Related papers (2025-10-23T16:44:34Z)
CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image [44.8172828045897]
Current methods often struggle with domain-specific limitations or low-quality object generation.<n>We propose CAST, a novel method for 3D scene reconstruction and recovery.
arXiv Detail & Related papers (2025-02-18T14:29:52Z)
Discovering an Image-Adaptive Coordinate System for Photography Processing [51.164345878060956]
We propose a novel algorithm, IAC, to learn an image-adaptive coordinate system in the RGB color space before performing curve operations.<n>This end-to-end trainable approach enables us to efficiently adjust images with a jointly learned image-adaptive coordinate system and curves.
arXiv Detail & Related papers (2025-01-11T06:20:07Z)
GenRC: Generative 3D Room Completion from Sparse Image Collections [17.222652213723485]
GenRC is an automated training-free pipeline to complete a room-scale 3D mesh with high-fidelity textures. E-Diffusion generates a view-consistent panoramic RGBD image which ensures global geometry and appearance consistency. GenRC outperforms state-of-the-art methods under most appearance and geometric metrics on ScanNet and ARKitScenes datasets.
arXiv Detail & Related papers (2024-07-17T18:10:40Z)
VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control [74.5434726968562]
We show how to tame transformers video for 3D camera control using a ControlNet-like conditioning mechanism.<n>Our work is the first to enable camera control for transformer-based video diffusion models.
arXiv Detail & Related papers (2024-07-17T17:59:05Z)
PerLDiff: Controllable Street View Synthesis Using Perspective-Layout Diffusion Models [55.080748327139176]
We introduce PerLDiff, a novel method for effective street view image generation that fully leverages perspective 3D geometric information.<n>PerLDiff employs 3D geometric priors to guide the generation of street view images with precise object-level control within the network learning process.<n> Empirical results justify that our PerLDiff markedly enhances the precision of controllable generation on the NuScenes and KITTI datasets.
arXiv Detail & Related papers (2024-07-08T16:46:47Z)
SyncNoise: Geometrically Consistent Noise Prediction for Text-based 3D Scene Editing [58.22339174221563]
We propose SyncNoise, a novel geometry-guided multi-view consistent noise editing approach for high-fidelity 3D scene editing. SyncNoise synchronously edits multiple views with 2D diffusion models while enforcing multi-view noise predictions to be geometrically consistent. Our method achieves high-quality 3D editing results respecting the textual instructions, especially in scenes with complex textures.
arXiv Detail & Related papers (2024-06-25T09:17:35Z)
ViDSOD-100: A New Dataset and a Baseline Model for RGB-D Video Salient Object Detection [51.16181295385818]
We first collect an annotated RGB-D video SODOD (DSOD-100) dataset, which contains 100 videos within a total of 9,362 frames. All the frames in each video are manually annotated to a high-quality saliency annotation. We propose a new baseline model, named attentive triple-fusion network (ATF-Net) for RGB-D salient object detection.
arXiv Detail & Related papers (2024-06-18T12:09:43Z)
FusionVision: A comprehensive approach of 3D object reconstruction and segmentation from RGB-D cameras using YOLO and fast segment anything [1.5728609542259502]
This paper introduces FusionVision, an exhaustive pipeline adapted for the robust 3D segmentation of objects in RGB-D imagery. The proposed FusionVision pipeline employs YOLO for identifying objects within the RGB image domain. The synergy between these components and their integration into 3D scene understanding ensures a cohesive fusion of object detection and segmentation.
arXiv Detail & Related papers (2024-02-29T22:59:27Z)
EvPlug: Learn a Plug-and-Play Module for Event and Image Fusion [55.367269556557645]
EvPlug learns a plug-and-play event and image fusion module from the supervision of the existing RGB-based model. We demonstrate the superiority of EvPlug in several vision tasks such as object detection, semantic segmentation, and 3D hand pose estimation.
arXiv Detail & Related papers (2023-12-28T10:05:13Z)
Anyview: Generalizable Indoor 3D Object Detection with Variable Frames [60.48134767838629]
We present a novel 3D detection framework named AnyView for our practical applications.<n>Our method achieves both great generalizability and high detection accuracy with a simple and clean architecture.
arXiv Detail & Related papers (2023-10-09T02:15:45Z)
ODAM: Object Detection, Association, and Mapping using Posed RGB Video [36.16010611723447]
We present ODAM, a system for 3D Object Detection, Association, and Mapping using posed RGB videos. The proposed system relies on a deep learning front-end to detect 3D objects from a given RGB frame and associate them to a global object-based map using a graph neural network (GNN)
arXiv Detail & Related papers (2021-08-23T13:28:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.