Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images
- URL: http://arxiv.org/abs/2103.07894v3
- Date: Wed, 17 Mar 2021 06:35:20 GMT
- Title: Refer-it-in-RGBD: A Bottom-up Approach for 3D Visual Grounding in RGBD
Images
- Authors: Haolin Liu, Anran Lin, Xiaoguang Han, Lei Yang, Yizhou Yu, Shuguang
Cui
- Abstract summary: Grounding referring expressions in RGBD image has been an emerging field.
We present a novel task of 3D visual grounding in single-view RGBD image where the referred objects are often only partially scanned due to occlusion.
Our approach first fuses the language and the visual features at the bottom level to generate a heatmap that localizes the relevant regions in the RGBD image.
Then our approach conducts an adaptive feature learning based on the heatmap and performs the object-level matching with another visio-linguistic fusion to finally ground the referred object.
- Score: 69.5662419067878
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Grounding referring expressions in RGBD image has been an emerging field. We
present a novel task of 3D visual grounding in single-view RGBD image where the
referred objects are often only partially scanned due to occlusion. In contrast
to previous works that directly generate object proposals for grounding in the
3D scenes, we propose a bottom-up approach to gradually aggregate context-aware
information, effectively addressing the challenge posed by the partial
geometry. Our approach first fuses the language and the visual features at the
bottom level to generate a heatmap that coarsely localizes the relevant regions
in the RGBD image. Then our approach conducts an adaptive feature learning
based on the heatmap and performs the object-level matching with another
visio-linguistic fusion to finally ground the referred object. We evaluate the
proposed method by comparing to the state-of-the-art methods on both the RGBD
images extracted from the ScanRefer dataset and our newly collected SUNRefer
dataset. Experiments show that our method outperforms the previous methods by a
large margin (by 11.2% and 15.6% Acc@0.5) on both datasets.
Related papers
- Towards Human-Level 3D Relative Pose Estimation: Generalizable, Training-Free, with Single Reference [62.99706119370521]
Humans can easily deduce the relative pose of an unseen object, without label/training, given only a single query-reference image pair.
We propose a novel 3D generalizable relative pose estimation method by elaborating (i) with a 2.5D shape from an RGB-D reference, (ii) with an off-the-shelf differentiable, and (iii) with semantic cues from a pretrained model like DINOv2.
arXiv Detail & Related papers (2024-06-26T16:01:10Z) - RDPN6D: Residual-based Dense Point-wise Network for 6Dof Object Pose Estimation Based on RGB-D Images [13.051302134031808]
We introduce a novel method for calculating the 6DoF pose of an object using a single RGB-D image.
Unlike existing methods that either directly predict objects' poses or rely on sparse keypoints for pose recovery, our approach addresses this challenging task using dense correspondence.
arXiv Detail & Related papers (2024-05-14T10:10:45Z) - MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images [57.71600854525037]
We propose a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images.
MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects.
arXiv Detail & Related papers (2024-03-03T14:01:03Z) - Point Cloud Scene Completion with Joint Color and Semantic Estimation
from Single RGB-D Image [45.640943637433416]
We present a deep reinforcement learning method of progressive view inpainting for colored semantic point cloud scene completion under volume guidance.
Our approach is end-to-end, consisting of three modules: 3D scene volume reconstruction, 2D RGB-D and segmentation image inpainting, and multi-view selection for completion.
arXiv Detail & Related papers (2022-10-12T03:08:24Z) - Unsupervised Multi-View Object Segmentation Using Radiance Field
Propagation [55.9577535403381]
We present a novel approach to segmenting objects in 3D during reconstruction given only unlabeled multi-view images of a scene.
The core of our method is a novel propagation strategy for individual objects' radiance fields with a bidirectional photometric loss.
To the best of our knowledge, RFP is the first unsupervised approach for tackling 3D scene object segmentation for neural radiance field (NeRF)
arXiv Detail & Related papers (2022-10-02T11:14:23Z) - Towards Two-view 6D Object Pose Estimation: A Comparative Study on
Fusion Strategy [16.65699606802237]
Current RGB-based 6D object pose estimation methods have achieved noticeable performance on datasets and real world applications.
This paper proposes a framework for 6D object pose estimation that learns implicit 3D information from 2 RGB images.
arXiv Detail & Related papers (2022-07-01T08:22:34Z) - Memory-Augmented Reinforcement Learning for Image-Goal Navigation [67.3963444878746]
We present a novel method that leverages a cross-episode memory to learn to navigate.
In order to avoid overfitting, we propose to use data augmentation on the RGB input during training.
We obtain this competitive performance from RGB input only, without access to additional sensors such as position or depth.
arXiv Detail & Related papers (2021-01-13T16:30:20Z) - Geometric Correspondence Fields: Learned Differentiable Rendering for 3D
Pose Refinement in the Wild [96.09941542587865]
We present a novel 3D pose refinement approach based on differentiable rendering for objects of arbitrary categories in the wild.
In this way, we precisely align 3D models to objects in RGB images which results in significantly improved 3D pose estimates.
We evaluate our approach on the challenging Pix3D dataset and achieve up to 55% relative improvement compared to state-of-the-art refinement methods in multiple metrics.
arXiv Detail & Related papers (2020-07-17T12:34:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.