Related papers: FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion

URL: http://arxiv.org/abs/2601.15250v1
Date: Wed, 21 Jan 2026 18:32:27 GMT
Title: FlowSSC: Universal Generative Monocular Semantic Scene Completion via One-Step Latent Diffusion
Authors: Zichen Xi, Hao-Xiang Chen, Nan Xue, Hongyu Yan, Qi-Yuan Feng, Levent Burak Kara, Joaquim Jorge, Qun-Ce Xu,
Abstract summary: FlowSSC is the first generative framework applied directly to semantic scene completion.<n>To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching.<n>Our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems.
Score: 7.222522567077674
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic Scene Completion (SSC) from monocular RGB images is a fundamental yet challenging task due to the inherent ambiguity of inferring occluded 3D geometry from a single view. While feed-forward methods have made progress, they often struggle to generate plausible details in occluded regions and preserve the fundamental spatial relationships of objects. Such accurate generative reasoning capability for the entire 3D space is critical in real-world applications. In this paper, we present FlowSSC, the first generative framework applied directly to monocular semantic scene completion. FlowSSC treats the SSC task as a conditional generation problem and can seamlessly integrate with existing feed-forward SSC methods to significantly boost their performance. To achieve real-time inference without compromising quality, we introduce Shortcut Flow-matching that operates in a compact triplane latent space. Unlike standard diffusion models that require hundreds of steps, our method utilizes a shortcut mechanism to achieve high-fidelity generation in a single step, enabling practical deployment in autonomous systems. Extensive experiments on SemanticKITTI demonstrate that FlowSSC achieves state-of-the-art performance, significantly outperforming existing baselines.

Related papers

StdGEN++: A Comprehensive System for Semantic-Decomposed 3D Character Generation [57.06461272772509]
StdGEN++ is a novel and comprehensive system for generating high-fidelity, semantically decomposed 3D characters from diverse inputs.<n>It achieves state-of-the-art performance, significantly outperforming existing methods in geometric accuracy and semantic disentanglement.<n>The resulting structural independence unlocks advanced downstream capabilities, including non-destructive editing, physics-compliant animation, and gaze tracking.
arXiv Detail & Related papers (2026-01-12T15:41:27Z)
SPHERE: Semantic-PHysical Engaged REpresentation for 3D Semantic Scene Completion [52.959716866316604]
Camera-based 3D Semantic Scene Completion (SSC) is a critical task in autonomous driving systems.<n>We propose the Semantic-PHysical Engaged REpresentation (SPHERE) for camera-based SSC.<n>SPHERE integrates voxel and Gaussian representations for joint exploitation of semantic and physical information.
arXiv Detail & Related papers (2025-09-14T09:07:41Z)
One Step Closer: Creating the Future to Boost Monocular Semantic Scene Completion [3.664655957801223]
In real-world traffic scenarios, a significant portion of a visual 3D scene remains occluded or outside the camera's field of view.<n>We propose Creating the Future SSC, a novel temporal SSC framework that leverages pseudo-future frame prediction to expand the model's effective perceptual range.<n>Our approach combines poses and depths to establish accurate 3D correspondences, enabling geometrically-consistent fusion of past, present, and predicted future frames in 3D space.
arXiv Detail & Related papers (2025-07-18T10:24:58Z)
CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z)
Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance [42.815766778681144]
3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception.<n>Existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features.<n>We propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance.
arXiv Detail & Related papers (2025-02-20T12:52:36Z)
MetaSSC: Enhancing 3D Semantic Scene Completion for Autonomous Driving through Meta-Learning and Long-sequence Modeling [3.139165705827712]
We introduce MetaSSC, a novel meta-learning-based framework for semantic scene completion (SSC)<n>Our approach begins with a voxel-based semantic segmentation (SS) pretraining task, aimed at exploring the semantics and geometry of incomplete regions.<n>Using simulated cooperative perception datasets, we supervise the perception training of a single vehicle using aggregated sensor data.<n>This meta-knowledge is then adapted to the target domain through a dual-phase training strategy, enabling efficient deployment.
arXiv Detail & Related papers (2024-11-06T05:11:25Z)
Hierarchical Temporal Context Learning for Camera-based Semantic Scene Completion [57.232688209606515]
We present HTCL, a novel Temporal Temporal Context Learning paradigm for improving camera-based semantic scene completion. Our method ranks $1st$ on the Semantic KITTI benchmark and even surpasses LiDAR-based methods in terms of mIoU.
arXiv Detail & Related papers (2024-07-02T09:11:17Z)
Camera-based 3D Semantic Scene Completion with Sparse Guidance Network [18.415854443539786]
We propose a camera-based semantic scene completion framework called SGN. SGN propagates semantics from semantic-aware seed voxels to the whole scene based on spatial geometry cues. Our experimental results demonstrate the superiority of our SGN over existing state-of-the-art methods.
arXiv Detail & Related papers (2023-12-10T04:17:27Z)
D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction. First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm. Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z)
EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity [13.02735046166494]
Self-supervised monocular scene flow estimation has received increasing attention for its simple and economical sensor setup. We propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning. On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44%.
arXiv Detail & Related papers (2023-09-04T00:30:06Z)
SSC-RS: Elevate LiDAR Semantic Scene Completion with Representation Separation and BEV Fusion [17.459062337718677]
We propose to solve outdoor SSC from the perspective of representation separation and BEV fusion. We present the network, named SSC-RS, which uses separate branches with deep supervision to explicitly disentangle the learning procedure of the semantic and geometric representations. A BEV fusion network equipped with the proposed Adaptive Representation Fusion (ARF) module is presented to aggregate the multi-scale features effectively and efficiently.
arXiv Detail & Related papers (2023-06-27T10:02:45Z)
Self-Supervised Multi-Frame Monocular Scene Flow [61.588808225321735]
We introduce a multi-frame monocular scene flow network based on self-supervised learning. We observe state-of-the-art accuracy among monocular scene flow methods based on self-supervised learning.
arXiv Detail & Related papers (2021-05-05T17:49:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.