Monocular Semantic Scene Completion via Masked Recurrent Networks
- URL: http://arxiv.org/abs/2507.17661v1
- Date: Wed, 23 Jul 2025 16:29:45 GMT
- Title: Monocular Semantic Scene Completion via Masked Recurrent Networks
- Authors: Xuzhi Wang, Xinran Wu, Song Wang, Lingdong Kong, Ziping Zhao,
- Abstract summary: Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination.<n>We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network.<n> Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes.
- Score: 11.783890904850828
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Monocular Semantic Scene Completion (MSSC) aims to predict the voxel-wise occupancy and semantic category from a single-view RGB image. Existing methods adopt a single-stage framework that aims to simultaneously achieve visible region segmentation and occluded region hallucination, while also being affected by inaccurate depth estimation. Such methods often achieve suboptimal performance, especially in complex scenes. We propose a novel two-stage framework that decomposes MSSC into coarse MSSC followed by the Masked Recurrent Network. Specifically, we propose the Masked Sparse Gated Recurrent Unit (MS-GRU) which concentrates on the occupied regions by the proposed mask updating mechanism, and a sparse GRU design is proposed to reduce the computation cost. Additionally, we propose the distance attention projection to reduce projection errors by assigning different attention scores according to the distance to the observed surface. Experimental results demonstrate that our proposed unified framework, MonoMRN, effectively supports both indoor and outdoor scenes and achieves state-of-the-art performance on the NYUv2 and SemanticKITTI datasets. Furthermore, we conduct robustness analysis under various disturbances, highlighting the role of the Masked Recurrent Network in enhancing the model's resilience to such challenges. The source code is publicly available.
Related papers
- MaskAttn-UNet: A Mask Attention-Driven Framework for Universal Low-Resolution Image Segmentation [5.130440339897479]
MaskAttn-UNet is a novel segmentation framework that enhances the traditional U-Net architecture via a mask attention mechanism.<n>Our model selectively emphasizes important regions while suppressing irrelevant backgrounds, thereby improving segmentation accuracy in cluttered and complex scenes.<n>Our results show that MaskAttn-UNet achieves accuracy comparable to state-of-the-art methods at significantly lower computational cost than transformer-based models.
arXiv Detail & Related papers (2025-03-11T22:43:26Z) - RUN: Reversible Unfolding Network for Concealed Object Segmentation [61.13528324971598]
reversible strategies across both mask and RGB domains.<n>We propose the Reversible Unfolding Network (RUN), which applies reversible strategies across both mask and RGB domains.
arXiv Detail & Related papers (2025-01-30T22:19:15Z) - Exact: Exploring Space-Time Perceptive Clues for Weakly Supervised Satellite Image Time Series Semantic Segmentation [11.193770734116981]
This paper embraces the weakly supervised paradigm (i.e., only image-level categories available) to liberate the crop mapping task from the exhaustive annotation burden.<n>We propose a novel method, termed exploring space-time perceptive clues (Exact)<n>Our method demonstrates impressive performance on various SITS benchmarks.
arXiv Detail & Related papers (2024-12-05T08:37:56Z) - Bridge the Points: Graph-based Few-shot Segment Anything Semantically [79.1519244940518]
Recent advancements in pre-training techniques have enhanced the capabilities of vision foundation models.
Recent studies extend the SAM to Few-shot Semantic segmentation (FSS)
We propose a simple yet effective approach based on graph analysis.
arXiv Detail & Related papers (2024-10-09T15:02:28Z) - Reprojection Errors as Prompts for Efficient Scene Coordinate Regression [9.039259735902625]
Scene coordinate regression (SCR) methods have emerged as a promising area of research due to their potential for accurate visual localization.
Many existing SCR approaches train on samples from all image regions, including dynamic objects and texture-less areas.
We introduce an error-guided feature selection mechanism, in tandem with the use of the Segment Anything Model (SAM)
This mechanism seeds low reprojection areas as prompts and expands them into error-guided masks, and then utilizes these masks to sample points and filter out problematic areas in an iterative manner.
arXiv Detail & Related papers (2024-09-06T10:43:34Z) - Feature Attenuation of Defective Representation Can Resolve Incomplete Masking on Anomaly Detection [1.0358639819750703]
In unsupervised anomaly detection (UAD) research, it is necessary to develop a computationally efficient and scalable solution.
We revisit the reconstruction-by-inpainting approach and rethink to improve it by analyzing strengths and weaknesses.
We propose Feature Attenuation of Defective Representation (FADeR) that only employs two layers which attenuates feature information of anomaly reconstruction.
arXiv Detail & Related papers (2024-07-05T15:44:53Z) - Region-aware Distribution Contrast: A Novel Approach to Multi-Task Partially Supervised Learning [50.88504784466931]
Multi-task dense prediction involves semantic segmentation, depth estimation, and surface normal estimation.
Existing solutions typically rely on learning global image representations for global cross-task image matching.
Our proposal involves modeling region-wise representations using Gaussian Distributions.
arXiv Detail & Related papers (2024-03-15T12:41:30Z) - SODAR: Segmenting Objects by DynamicallyAggregating Neighboring Mask
Representations [90.8752454643737]
Recent state-of-the-art one-stage instance segmentation model SOLO divides the input image into a grid and directly predicts per grid cell object masks with fully-convolutional networks.
We observe SOLO generates similar masks for an object at nearby grid cells, and these neighboring predictions can complement each other as some may better segment certain object part.
Motivated by the observed gap, we develop a novel learning-based aggregation method that improves upon SOLO by leveraging the rich neighboring information.
arXiv Detail & Related papers (2022-02-15T13:53:03Z) - Calibrated Hyperspectral Image Reconstruction via Graph-based
Self-Tuning Network [40.71031760929464]
Hyperspectral imaging (HSI) has attracted increasing research attention, especially for the ones based on a coded snapshot spectral imaging (CASSI) system.
Existing deep HSI reconstruction models are generally trained on paired data to retrieve original signals upon 2D compressed measurements given by a particular optical hardware mask in CASSI.
This mask-specific training style will lead to a hardware miscalibration issue, which sets up barriers to deploying deep HSI models among different hardware and noisy environments.
We propose a novel Graph-based Self-Tuning ( GST) network to reason uncertainties adapting to varying spatial structures of masks among
arXiv Detail & Related papers (2021-12-31T09:39:13Z) - Semantic Attention and Scale Complementary Network for Instance
Segmentation in Remote Sensing Images [54.08240004593062]
We propose an end-to-end multi-category instance segmentation model, which consists of a Semantic Attention (SEA) module and a Scale Complementary Mask Branch (SCMB)
SEA module contains a simple fully convolutional semantic segmentation branch with extra supervision to strengthen the activation of interest instances on the feature map.
SCMB extends the original single mask branch to trident mask branches and introduces complementary mask supervision at different scales.
arXiv Detail & Related papers (2021-07-25T08:53:59Z) - Spatial and spectral deep attention fusion for multi-channel speech
separation using deep embedding features [60.20150317299749]
Multi-channel deep clustering (MDC) has acquired a good performance for speech separation.
We propose a deep attention fusion method to dynamically control the weights of the spectral and spatial features and combine them deeply.
Experimental results show that the proposed method outperforms MDC baseline and even better than the ideal binary mask (IBM)
arXiv Detail & Related papers (2020-02-05T03:49:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.