Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion
- URL: http://arxiv.org/abs/2503.05086v1
- Date: Fri, 07 Mar 2025 02:09:38 GMT
- Title: Fake It To Make It: Virtual Multiviews to Enhance Monocular Indoor Semantic Scene Completion
- Authors: Anith Selvakumar, Manasa Bharadwaj,
- Abstract summary: Monocular Indoor Semantic Scene Completion aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene.<n>We introduce an innovative approach that leverages novel view synthesis and multiview fusion.<n>We demonstrate IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset.
- Score: 0.8669877024051931
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Monocular Indoor Semantic Scene Completion (SSC) aims to reconstruct a 3D semantic occupancy map from a single RGB image of an indoor scene, inferring spatial layout and object categories from 2D image cues. The challenge of this task arises from the depth, scale, and shape ambiguities that emerge when transforming a 2D image into 3D space, particularly within the complex and often heavily occluded environments of indoor scenes. Current SSC methods often struggle with these ambiguities, resulting in distorted or missing object representations. To overcome these limitations, we introduce an innovative approach that leverages novel view synthesis and multiview fusion. Specifically, we demonstrate how virtual cameras can be placed around the scene to emulate multiview inputs that enhance contextual scene information. We also introduce a Multiview Fusion Adaptor (MVFA) to effectively combine the multiview 3D scene predictions into a unified 3D semantic occupancy map. Finally, we identify and study the inherent limitation of generative techniques when applied to SSC, specifically the Novelty-Consistency tradeoff. Our system, GenFuSE, demonstrates IoU score improvements of up to 2.8% for Scene Completion and 4.9% for Semantic Scene Completion when integrated with existing SSC networks on the NYUv2 dataset. This work introduces GenFuSE as a standard framework for advancing monocular SSC with synthesized inputs.
Related papers
- Visibility-Uncertainty-guided 3D Gaussian Inpainting via Scene Conceptional Learning [63.94919846010485]
3D Gaussian inpainting (3DGI) is challenging in effectively leveraging complementary visual and semantic cues from multiple input views.
We propose a method that measures the visibility uncertainties of 3D points across different input views and uses them to guide 3DGI.
We build a novel 3DGI framework, VISTA, by integrating VISibility-uncerTainty-guided 3DGI with scene conceptuAl learning.
arXiv Detail & Related papers (2025-04-23T06:21:11Z) - HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation [50.206100327643284]
HiScene is a novel hierarchical framework that bridges the gap between 2D image generation and 3D object generation.
We generate 3D content that aligns with 2D representations while maintaining compositional structure.
arXiv Detail & Related papers (2025-04-17T16:33:39Z) - 3D Feature Distillation with Object-Centric Priors [9.626027459292926]
2D vision-language models such as CLIP have been widely popularized, due to their impressive capabilities for open-vocabulary grounding in 2D images.
Recent works aim to elevate 2D CLIP features to 3D via feature distillation, but either learn neural fields that are scene-specific or focus on indoor room scan data.
We show that our method reconstructs 3D CLIP features with improved grounding capacity and spatial consistency.
arXiv Detail & Related papers (2024-06-26T20:16:49Z) - Zero-Shot Multi-Object Scene Completion [59.325611678171974]
We present a 3D scene completion method that recovers the complete geometry of multiple unseen objects in complex scenes from a single RGB-D image.
Our method outperforms the current state-of-the-art on both synthetic and real-world datasets.
arXiv Detail & Related papers (2024-03-21T17:59:59Z) - NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized
Device Coordinates Space [77.6067460464962]
Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs.
We identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Imbalance in the 3D convolution across different depth levels.
We devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2
arXiv Detail & Related papers (2023-09-26T02:09:52Z) - Bridging Stereo Geometry and BEV Representation with Reliable Mutual Interaction for Semantic Scene Completion [45.171150395915056]
3D semantic scene completion (SSC) is an ill-posed perception task that requires inferring a dense 3D scene from limited observations.
Previous camera-based methods struggle to predict accurate semantic scenes due to inherent geometric ambiguity and incomplete observations.
We resort to stereo matching technique and bird's-eye-view (BEV) representation learning to address such issues in SSC.
arXiv Detail & Related papers (2023-03-24T12:33:44Z) - CompNVS: Novel View Synthesis with Scene Completion [83.19663671794596]
We propose a generative pipeline performing on a sparse grid-based neural scene representation to complete unobserved scene parts.
We process encoded image features in 3D space with a geometry completion network and a subsequent texture inpainting network to extrapolate the missing area.
Photorealistic image sequences can be finally obtained via consistency-relevant differentiable rendering.
arXiv Detail & Related papers (2022-07-23T09:03:13Z) - Recognizing Scenes from Novel Viewpoints [99.90914180489456]
Humans can perceive scenes in 3D from a handful of 2D views. For AI agents, the ability to recognize a scene from any viewpoint given only a few images enables them to efficiently interact with the scene and its objects.
We propose a model which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.
arXiv Detail & Related papers (2021-12-02T18:59:40Z) - MonoScene: Monocular 3D Semantic Scene Completion [9.92186106077902]
Mono proposes a 3D Semantic Scene Completion (SSC) framework, where the dense geometry and semantics of a scene are inferred from a single monocular image.
Our framework relies on successive 2D and 3D UNets bridged by a novel 2-3D features projection inspiring from optics.
arXiv Detail & Related papers (2021-12-01T18:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.