Multi-Modality Task Cascade for 3D Object Detection
- URL: http://arxiv.org/abs/2107.04013v1
- Date: Thu, 8 Jul 2021 17:55:01 GMT
- Title: Multi-Modality Task Cascade for 3D Object Detection
- Authors: Jinhyung Park, Xinshuo Weng, Yunze Man, Kris Kitani
- Abstract summary: Many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data.
We propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions.
We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance.
- Score: 22.131228757850373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Point clouds and RGB images are naturally complementary modalities for 3D
visual understanding - the former provides sparse but accurate locations of
points on objects, while the latter contains dense color and texture
information. Despite this potential for close sensor fusion, many methods train
two models in isolation and use simple feature concatenation to represent 3D
sensor data. This separated training scheme results in potentially sub-optimal
performance and prevents 3D tasks from being used to benefit 2D tasks that are
often useful on their own. To provide a more integrated approach, we propose a
novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box
proposals to improve 2D segmentation predictions, which are then used to
further refine the 3D boxes. We show that including a 2D network between two
stages of 3D modules significantly improves both 2D and 3D task performance.
Moreover, to prevent the 3D module from over-relying on the overfitted 2D
predictions, we propose a dual-head 2D segmentation training and inference
scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D
segmentation predictions. Evaluating our model on the challenging SUN RGB-D
dataset, we improve upon state-of-the-art results of both single modality and
fusion networks by a large margin ($\textbf{+3.8}$ mAP@0.5). Code will be
released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$
Related papers
- ConDense: Consistent 2D/3D Pre-training for Dense and Sparse Features from Multi-View Images [47.682942867405224]
ConDense is a framework for 3D pre-training utilizing existing 2D networks and large-scale multi-view datasets.
We propose a novel 2D-3D joint training scheme to extract co-embedded 2D and 3D features in an end-to-end pipeline.
arXiv Detail & Related papers (2024-08-30T05:57:01Z) - Multi-View Representation is What You Need for Point-Cloud Pre-Training [22.55455166875263]
This paper proposes a novel approach to point-cloud pre-training that learns 3D representations by leveraging pre-trained 2D networks.
We train the 3D feature extraction network with the help of the novel 2D knowledge transfer loss.
Experimental results demonstrate that our pre-trained model can be successfully transferred to various downstream tasks.
arXiv Detail & Related papers (2023-06-05T03:14:54Z) - Joint-MAE: 2D-3D Joint Masked Autoencoders for 3D Point Cloud
Pre-training [65.75399500494343]
Masked Autoencoders (MAE) have shown promising performance in self-supervised learning for 2D and 3D computer vision.
We propose Joint-MAE, a 2D-3D joint MAE framework for self-supervised 3D point cloud pre-training.
arXiv Detail & Related papers (2023-02-27T17:56:18Z) - SSR-2D: Semantic 3D Scene Reconstruction from 2D Images [54.46126685716471]
In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction without using any 3D annotations.
The key idea of our approach is to design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images.
Our method achieves the state-of-the-art performance of semantic scene completion on two large-scale benchmark datasets MatterPort3D and ScanNet.
arXiv Detail & Related papers (2023-02-07T17:47:52Z) - MvDeCor: Multi-view Dense Correspondence Learning for Fine-grained 3D
Segmentation [91.6658845016214]
We propose to utilize self-supervised techniques in the 2D domain for fine-grained 3D shape segmentation tasks.
We render a 3D shape from multiple views, and set up a dense correspondence learning task within the contrastive learning framework.
As a result, the learned 2D representations are view-invariant and geometrically consistent.
arXiv Detail & Related papers (2022-08-18T00:48:15Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR-based
Perception [122.53774221136193]
State-of-the-art methods for driving-scene LiDAR-based perception often project the point clouds to 2D space and then process them via 2D convolution.
A natural remedy is to utilize the 3D voxelization and 3D convolution network.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pattern.
arXiv Detail & Related papers (2021-09-12T06:25:11Z) - IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation
through Iterative Mutual Enhancement [12.091735711364239]
We propose an Iterative Mutual Enhancement Network (IMENet) to solve 3D semantic scene completion and 2D semantic segmentation.
IMENet interactively refines the two tasks at the late prediction stage.
Our approach outperforms the state of the art on both 3D semantic scene completion and 2D semantic segmentation.
arXiv Detail & Related papers (2021-06-29T13:34:20Z) - FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection [78.00922683083776]
It is non-trivial to make a general adapted 2D detector work in this 3D task.
In this technical report, we study this problem with a practice built on fully convolutional single-stage detector.
Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020.
arXiv Detail & Related papers (2021-04-22T09:35:35Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z) - 3D-MiniNet: Learning a 2D Representation from Point Clouds for Fast and
Efficient 3D LIDAR Semantic Segmentation [9.581605678437032]
3D-MiniNet is a novel approach for LIDAR semantic segmentation that combines 3D and 2D learning layers.
It first learns a 2D representation from the raw points through a novel projection which extracts local and global information from the 3D data.
These 2D semantic labels are re-projected back to the 3D space and enhanced through a post-processing module.
arXiv Detail & Related papers (2020-02-25T14:33:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.