Related papers: ROODI: Reconstructing Occluded Objects with Denoising Inpainters

ROODI: Reconstructing Occluded Objects with Denoising Inpainters

URL: http://arxiv.org/abs/2503.10256v2
Date: Sat, 09 Aug 2025 14:45:49 GMT
Title: ROODI: Reconstructing Occluded Objects with Denoising Inpainters
Authors: Yeonjin Chang, Erqun Dong, Seunghyeon Seo, Nojun Kwak, Kwang Moo Yi,
Abstract summary: novel object extraction method based on two key principles.<n>We employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene.<n>Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.
Score: 34.37743884589211
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While the quality of novel-view images has improved dramatically with 3D Gaussian Splatting, extracting specific objects from scenes remains challenging. Isolating individual 3D Gaussian primitives for each object and handling occlusions in scenes remains far from being solved. We propose a novel object extraction method based on two key principles: (1) object-centric reconstruction through removal of irrelevant primitives; and (2) leveraging generative inpainting to compensate for missing observations caused by occlusions. For pruning, we propose to remove irrelevant Gaussians by looking into how close they are to its K-nearest neighbors and removing those that are statistical outliers. Importantly, these distances must take into account the actual spatial extent they cover -- we thus propose to use Wasserstein distances. For inpainting, we employ an off-the-shelf diffusion-based inpainter combined with occlusion reasoning, utilizing the 3D representation of the entire scene. Our findings highlight the crucial synergy between proper pruning and inpainting, both of which significantly enhance extraction performance. We evaluate our method on a standard real-world dataset and introduce a synthetic dataset for quantitative analysis. Our approach outperforms the state-of-the-art, demonstrating its effectiveness in object extraction from complex scenes.

Related papers

Beyond Averages: Open-Vocabulary 3D Scene Understanding with Gaussian Splatting and Bag of Embeddings [17.855913571198013]
We propose a paradigm-shifting alternative that bypasses differentiable rendering for semantics entirely.<n>Our key insight is to leverage predecomposed object-level Gaussians and represent each object through multiview CLIP feature aggregation.<n>This allows: (1) accurate open-vocabulary object retrieval by comparing text queries to object-level (not Gaussian-level) embeddings, and (2) seamless task adaptation: propagating object IDs to pixels for 2D segmentation or to Gaussians for 3D extraction.
arXiv Detail & Related papers (2025-09-16T10:39:37Z)
ObjectGS: Object-aware Scene Reconstruction and Scene Understanding via Gaussian Splatting [54.92763171355442]
ObjectGS is an object-aware framework that unifies 3D scene reconstruction with semantic understanding.<n>We show through experiments that ObjectGS outperforms state-of-the-art methods on open-vocabulary and panoptic segmentation tasks.
arXiv Detail & Related papers (2025-07-21T10:06:23Z)
TRAN-D: 2D Gaussian Splatting-based Sparse-view Transparent Object Depth Reconstruction via Physics Simulation for Scene Update [14.360210515795904]
TRAN-D is a novel 2D Gaussian Splatting-based depth reconstruction method for transparent objects.<n>We mitigate artifacts with an object-aware loss that places Gaussians in obscured regions.<n>We incorporate a physics-based simulation that refines the reconstruction in just a few seconds.
arXiv Detail & Related papers (2025-07-15T08:02:37Z)
RobustSplat: Decoupling Densification and Dynamics for Transient-Free 3DGS [79.15416002879239]
3D Gaussian Splatting has gained significant attention for its real-time, photo-realistic rendering in novel-view synthesis and 3D modeling.<n>Existing methods struggle with accurately modeling scenes affected by transient objects, leading to artifacts in the rendered images.<n>We propose RobustSplat, a robust solution based on two critical designs.
arXiv Detail & Related papers (2025-06-03T11:13:48Z)
IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction. We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images. We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z)
DeclutterNeRF: Generative-Free 3D Scene Recovery for Occlusion Removal [12.381139489267495]
We introduce DeclutterNeRF, an occlusion removal method free from generative priors. DeclutterNeRF significantly outperforms state-of-the-art methods on our proposed DeclutterSet.
arXiv Detail & Related papers (2025-04-07T02:22:08Z)
Robust 3D Gaussian Splatting for Novel View Synthesis in Presence of Distractors [44.55317154371679]
3D Gaussian Splatting has shown impressive novel view synthesis results. It is vulnerable to dynamic objects polluting the input data of an otherwise static scene, so called distractors. We show that our approach is robust to various distractors and strongly improves rendering quality on distractor-polluted scenes.
arXiv Detail & Related papers (2024-08-21T15:21:27Z)
SkelFormer: Markerless 3D Pose and Shape Estimation using Skeletal Transformers [57.46911575980854]
We introduce SkelFormer, a novel markerless motion capture pipeline for multi-view human pose and shape estimation. Our method first uses off-the-shelf 2D keypoint estimators, pre-trained on large-scale in-the-wild data, to obtain 3D joint positions. Next, we design a regression-based inverse-kinematic skeletal transformer that maps the joint positions to pose and shape representations from heavily noisy observations.
arXiv Detail & Related papers (2024-04-19T04:51:18Z)
DPA-Net: Structured 3D Abstraction from Sparse Views via Differentiable Primitive Assembly [18.655229356566785]
We present a differentiable rendering framework to learn structured 3D abstractions from sparse RGB images. By leveraging differentiable volume rendering, our method does not require 3D supervision. Our method demonstrates superior performance over state-of-the-art alternatives for 3D primitive abstraction from sparse views.
arXiv Detail & Related papers (2024-04-01T03:10:36Z)
DVMNet++: Rethinking Relative Pose Estimation for Unseen Objects [59.51874686414509]
Existing approaches typically predict 3D translation utilizing the ground-truth object bounding box and approximate 3D rotation with a large number of discrete hypotheses.<n>We present a Deep Voxel Matching Network (DVMNet++) that computes the relative object pose in a single pass.<n>Our approach delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods.
arXiv Detail & Related papers (2024-03-20T15:41:32Z)
Robust Shape Fitting for 3D Scene Abstraction [33.84212609361491]
In particular, we can describe man-made environments using volumetric primitives such as cuboids or cylinders. We propose a robust estimator for primitive fitting, which meaningfully abstracts complex real-world environments using cuboids. Results on the NYU Depth v2 dataset demonstrate that the proposed algorithm successfully abstracts cluttered real-world 3D scene layouts.
arXiv Detail & Related papers (2024-03-15T16:37:43Z)
Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. We propose to leverage weakly supervised annotations to learn the 3D visual grounding model. We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z)
Occupancy Planes for Single-view RGB-D Human Reconstruction [120.5818162569105]
Single-view RGB-D human reconstruction with implicit functions is often formulated as per-point classification. We propose the occupancy planes (OPlanes) representation, which enables to formulate single-view RGB-D human reconstruction as occupancy prediction on planes which slice through the camera's view frustum.
arXiv Detail & Related papers (2022-08-04T17:59:56Z)
High-resolution Iterative Feedback Network for Camouflaged Object Detection [128.893782016078]
Spotting camouflaged objects that are visually assimilated into the background is tricky for object detection algorithms. We aim to extract the high-resolution texture details to avoid the detail degradation that causes blurred vision in edges and boundaries. We introduce a novel HitNet to refine the low-resolution representations by high-resolution features in an iterative feedback manner.
arXiv Detail & Related papers (2022-03-22T11:20:21Z)
Zoom In and Out: A Mixed-scale Triplet Network for Camouflaged Object Detection [0.0]
We propose a mixed-scale triplet network, bf ZoomNet, which mimics the behavior of humans when observing vague images. Specifically, our ZoomNet employs the zoom strategy to learn the discriminative mixed-scale semantics by the designed scale integration unit and hierarchical mixed-scale unit. Our proposed highly task-friendly model consistently surpasses the existing 23 state-of-the-art methods on four public datasets.
arXiv Detail & Related papers (2022-03-05T09:13:52Z)
Occlusion-Robust Object Pose Estimation with Holistic Representation [42.27081423489484]
State-of-the-art (SOTA) object pose estimators take a two-stage approach. We develop a novel occlude-and-blackout batch augmentation technique. We also develop a multi-precision supervision architecture to encourage holistic pose representation learning.
arXiv Detail & Related papers (2021-10-22T08:00:26Z)
Object Wake-up: 3-D Object Reconstruction, Animation, and in-situ Rendering from a Single Image [58.69732754597448]
Given a picture of a chair, could we extract the 3-D shape of the chair, animate its plausible articulations and motions, and render in-situ in its original image space? We devise an automated approach to extract and manipulate articulated objects in single images.
arXiv Detail & Related papers (2021-08-05T16:20:12Z)
Cuboids Revisited: Learning Robust 3D Shape Fitting to Single RGB Images [44.223070672713455]
In particular, man-made environments commonly consist of volumetric primitives such as cuboids or cylinders. Previous approaches directly estimate shape parameters from a 2D or 3D input, and are only able to reproduce simple objects. We propose a robust estimator for primitive fitting, which can meaningfully abstract real-world environments using cuboids.
arXiv Detail & Related papers (2021-05-05T13:36:00Z)
Secrets of 3D Implicit Object Shape Reconstruction in the Wild [92.5554695397653]
Reconstructing high-fidelity 3D objects from sparse, partial observation is crucial for various applications in computer vision, robotics, and graphics. Recent neural implicit modeling methods show promising results on synthetic or dense datasets. But, they perform poorly on real-world data that is sparse and noisy. This paper analyzes the root cause of such deficient performance of a popular neural implicit model.
arXiv Detail & Related papers (2021-01-18T03:24:48Z)
Convolutional Occupancy Networks [88.48287716452002]
We propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases, enabling structured reasoning in 3D space. We empirically find that our method enables the fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes, and generalizes well from synthetic to real data.
arXiv Detail & Related papers (2020-03-10T10:17:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.