Related papers: Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images

URL: http://arxiv.org/abs/2508.06546v1
Date: Tue, 05 Aug 2025 21:25:50 GMT
Title: Statistical Confidence Rescoring for Robust 3D Scene Graph Generation from Multi-View Images
Authors: Qi Xun Yeo, Yanyan Li, Gim Hee Lee,
Abstract summary: semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships.<n>We overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features.<n>Our method outperforms current methods purely using multi-view images as the initial input.
Score: 56.134885746889026
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Modern 3D semantic scene graph estimation methods utilize ground truth 3D annotations to accurately predict target objects, predicates, and relationships. In the absence of given 3D ground truth representations, we explore leveraging only multi-view RGB images to tackle this task. To attain robust features for accurate scene graph estimation, we must overcome the noisy reconstructed pseudo point-based geometry from predicted depth maps and reduce the amount of background noise present in multi-view image features. The key is to enrich node and edge features with accurate semantic and spatial information and through neighboring relations. We obtain semantic masks to guide feature aggregation to filter background features and design a novel method to incorporate neighboring node information to aid robustness of our scene graph estimates. Furthermore, we leverage on explicit statistical priors calculated from the training summary statistics to refine node and edge predictions based on their one-hop neighborhood. Our experiments show that our method outperforms current methods purely using multi-view images as the initial input. Our project page is available at https://qixun1.github.io/projects/SCRSSG.

Related papers

ViPOcc: Leveraging Visual Priors from Vision Foundation Models for Single-View 3D Occupancy Prediction [11.312780421161204]
In this paper, we propose ViPOcc, which leverages the visual priors from vision foundation models for fine-grained 3D occupancy prediction.<n>We also propose a semantic-guided non-overlapping Gaussian mixture sampler for efficient, instance-aware ray sampling.<n>Our experiments demonstrate the superior performance of ViPOcc in both 3D occupancy prediction and depth estimation tasks.
arXiv Detail & Related papers (2024-12-15T15:04:27Z)
Multiview Scene Graph [7.460438046915524]
A proper scene representation is central to the pursuit of spatial intelligence. We propose to build Multiview Scene Graphs (MSG) from unposed images. MSG represents a scene topologically with interconnected place and object nodes.
arXiv Detail & Related papers (2024-10-15T02:04:05Z)
DoubleTake: Geometry Guided Depth Estimation [17.464549832122714]
Estimating depth from a sequence of posed RGB images is a fundamental computer vision task. We introduce a reconstruction which combines volume features with a hint of the prior geometry, rendered as a depth map from the current camera location. We demonstrate that our method can run at interactive speeds, state-of-the-art estimates of depth and 3D scene in both offline and incremental evaluation scenarios.
arXiv Detail & Related papers (2024-06-26T14:29:05Z)
Inverse Neural Rendering for Explainable Multi-Object Tracking [35.072142773300655]
We recast 3D multi-object tracking from RGB cameras as an emphInverse Rendering (IR) problem. We optimize an image loss over generative latent spaces that inherently disentangle shape and appearance properties. We validate the generalization and scaling capabilities of our method by learning the generative prior exclusively from synthetic data.
arXiv Detail & Related papers (2024-04-18T17:37:53Z)
3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features [70.50665869806188]
3DiffTection is a state-of-the-art method for 3D object detection from single images. We fine-tune a diffusion model to perform novel view synthesis conditioned on a single image. We further train the model on target data with detection supervision.
arXiv Detail & Related papers (2023-11-07T23:46:41Z)
Incremental 3D Semantic Scene Graph Prediction from RGB Sequences [86.77318031029404]
We propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities.
arXiv Detail & Related papers (2023-05-04T11:32:16Z)
Soft Expectation and Deep Maximization for Image Feature Detection [68.8204255655161]
We propose SEDM, an iterative semi-supervised learning process that flips the question and first looks for repeatable 3D points, then trains a detector to localize them in image space. Our results show that this new model trained using SEDM is able to better localize the underlying 3D points in a scene.
arXiv Detail & Related papers (2021-04-21T00:35:32Z)
SceneGraphFusion: Incremental 3D Scene Graph Prediction from RGB-D Sequences [76.28527350263012]
We propose a method to incrementally build up semantic scene graphs from a 3D environment given a sequence of RGB-D frames. We aggregate PointNet features from primitive scene components by means of a graph neural network. Our approach outperforms 3D scene graph prediction methods by a large margin and its accuracy is on par with other 3D semantic and panoptic segmentation methods while running at 35 Hz.
arXiv Detail & Related papers (2021-03-27T13:00:36Z)
Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD Images [69.5662419067878]
Grounding referring expressions in RGBD image has been an emerging field. We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion. Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image. Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
arXiv Detail & Related papers (2021-03-14T11:18:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.