Bidirectional Projection Network for Cross Dimension Scene Understanding
- URL: http://arxiv.org/abs/2103.14326v1
- Date: Fri, 26 Mar 2021 08:31:39 GMT
- Title: Bidirectional Projection Network for Cross Dimension Scene Understanding
- Authors: Wenbo Hu, Hengshuang Zhao, Li Jiang, Jiaya Jia, Tien-Tsin Wong
- Abstract summary: We present a emphbidirectional projection network (BPNet) for joint 2D and 3D reasoning in an end-to-end manner.
Via the emphBPM, complementary 2D and 3D information can interact with each other in multiple architectural levels.
Our emphBPNet achieves top performance on the ScanNetV2 benchmark for both 2D and 3D semantic segmentation.
- Score: 69.29443390126805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 2D image representations are in regular grids and can be processed
efficiently, whereas 3D point clouds are unordered and scattered in 3D space.
The information inside these two visual domains is well complementary, e.g., 2D
images have fine-grained texture while 3D point clouds contain plentiful
geometry information. However, most current visual recognition systems process
them individually. In this paper, we present a \emph{bidirectional projection
network (BPNet)} for joint 2D and 3D reasoning in an end-to-end manner. It
contains 2D and 3D sub-networks with symmetric architectures, that are
connected by our proposed \emph{bidirectional projection module (BPM)}. Via the
\emph{BPM}, complementary 2D and 3D information can interact with each other in
multiple architectural levels, such that advantages in these two visual domains
can be combined for better scene recognition. Extensive quantitative and
qualitative experimental evaluations show that joint reasoning over 2D and 3D
visual domains can benefit both 2D and 3D scene understanding simultaneously.
Our \emph{BPNet} achieves top performance on the ScanNetV2 benchmark for both
2D and 3D semantic segmentation. Code is available at
\url{https://github.com/wbhu/BPNet}.
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - ODIN: A Single Model for 2D and 3D Segmentation [34.612953668151036]
ODIN is a model that segment and label both 2D RGB images and 3D point clouds.
It achieves state-of-the-art performance on ScanNet200, Matterport3D and AI2THOR 3D segmentation benchmarks.
arXiv Detail & Related papers (2024-01-04T18:59:25Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - Towards Deeper and Better Multi-view Feature Fusion for 3D Semantic
Segmentation [17.557697146752652]
2D & 3D semantic segmentation has become mainstream in 3D scene understanding.
It still remains elusive how to fuse and process the cross-dimensional features from these two distinct spaces.
In this paper, we argue that despite its simplicity, projecting unidirectionally multi-view 2D deep semantic features into the 3D space aligned with 3D deep semantic features could lead to better feature fusion.
arXiv Detail & Related papers (2022-12-13T15:58:25Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - 3D-Aware Indoor Scene Synthesis with Depth Priors [62.82867334012399]
Existing methods fail to model indoor scenes due to the large diversity of room layouts and the objects inside.
We argue that indoor scenes do not have a shared intrinsic structure, and hence only using 2D images cannot adequately guide the model with the 3D geometry.
arXiv Detail & Related papers (2022-02-17T09:54:29Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - Semantic Correspondence via 2D-3D-2D Cycle [58.023058561837686]
We propose a new method on predicting semantic correspondences by leveraging it to 3D domain.
We show that our method gives comparative and even superior results on standard semantic benchmarks.
arXiv Detail & Related papers (2020-04-20T05:27:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.