Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
- URL: http://arxiv.org/abs/2301.01283v3
- Date: Mon, 18 Sep 2023 09:53:25 GMT
- Title: Cross Modal Transformer: Towards Fast and Robust 3D Object Detection
- Authors: Junjie Yan, Yingfei Liu, Jianjian Sun, Fan Jia, Shuailin Li, Tiancai
Wang, Xiangyu Zhang
- Abstract summary: We propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection.
CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes.
It achieves 74.1% NDS on nuScenes test set while maintaining fast inference speed.
- Score: 34.920322396476934
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a robust 3D detector, named Cross Modal Transformer
(CMT), for end-to-end 3D multi-modal detection. Without explicit view
transformation, CMT takes the image and point clouds tokens as inputs and
directly outputs accurate 3D bounding boxes. The spatial alignment of
multi-modal tokens is performed by encoding the 3D points into multi-modal
features. The core design of CMT is quite simple while its performance is
impressive. It achieves 74.1\% NDS (state-of-the-art with single model) on
nuScenes test set while maintaining fast inference speed. Moreover, CMT has a
strong robustness even if the LiDAR is missing. Code is released at
https://github.com/junjie18/CMT.
Related papers
- Make Your ViT-based Multi-view 3D Detectors Faster via Token Compression [78.93023152602525]
Slow inference speed is one of the most crucial concerns for deploying multi-view 3D detectors to tasks with high real-time requirements like autonomous driving.
We propose a simple yet effective method called TokenCompression3D (ToC3D)
Our method can nearly maintain the performance of recent SOTA with up to 30% inference speedup, and the improvements are consistent after scaling up the ViT and input resolution.
arXiv Detail & Related papers (2024-09-01T06:58:08Z) - EmbodiedSAM: Online Segment Any 3D Thing in Real Time [61.2321497708998]
Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration.
An online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed.
arXiv Detail & Related papers (2024-08-21T17:57:06Z) - CT3D++: Improving 3D Object Detection with Keypoint-induced Channel-wise Transformer [42.68740105997167]
We introduce two frameworks for 3D object detection with minimal hand-crafted design.
Firstly, we propose CT3D, which sequentially performs raw-point-based embedding, a standard Transformer encoder, and a channel-wise decoder for point features within each proposal.
Secondly, we present an enhanced network called CT3D++, which incorporates geometric and semantic fusion-based embedding to extract more valuable and comprehensive proposal-aware information.
arXiv Detail & Related papers (2024-06-12T12:40:28Z) - Instant3D: Instant Text-to-3D Generation [101.25562463919795]
We propose a novel framework for fast text-to-3D generation, dubbed Instant3D.
Instant3D is able to create a 3D object for an unseen text prompt in less than one second with a single run of a feedforward network.
arXiv Detail & Related papers (2023-11-14T18:59:59Z) - Multi-Modal 3D Object Detection by Box Matching [109.43430123791684]
We propose a novel Fusion network by Box Matching (FBMNet) for multi-modal 3D detection.
With the learned assignments between 3D and 2D object proposals, the fusion for detection can be effectively performed by combing their ROI features.
arXiv Detail & Related papers (2023-05-12T18:08:51Z) - TR3D: Towards Real-Time Indoor 3D Object Detection [6.215404942415161]
TR3D is a fully-convolutional 3D object detection model trained end-to-end.
To take advantage of both point cloud and RGB inputs, we introduce an early fusion of 2D and 3D features.
Our model with early feature fusion, which we refer to as TR3D+FF, outperforms existing 3D object detection approaches on the SUN RGB-D dataset.
arXiv Detail & Related papers (2023-02-06T15:25:50Z) - Multimodal Transformer for Automatic 3D Annotation and Object Detection [27.92241487946078]
We propose an end-to-end multimodal transformer (MTrans) autolabeler to generate precise 3D box annotations from weak 2D bounding boxes.
With a multi-task design, MTrans segments the foreground/background, densifies LiDAR point clouds, and regresses 3D boxes simultaneously.
By enriching the sparse point clouds, our method achieves 4.48% and 4.03% better 3D AP on KITTI moderate and hard samples, respectively.
arXiv Detail & Related papers (2022-07-20T10:38:29Z) - Progressive Coordinate Transforms for Monocular 3D Object Detection [52.00071336733109]
We propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
In this paper, we propose a novel and lightweight approach, dubbed em Progressive Coordinate Transforms (PCT) to facilitate learning coordinate representations.
arXiv Detail & Related papers (2021-08-12T15:22:33Z) - Weakly Supervised Volumetric Image Segmentation with Deformed Templates [80.04326168716493]
We propose an approach that is truly weakly-supervised in the sense that we only need to provide a sparse set of 3D point on the surface of target objects.
We will show that it outperforms a more traditional approach to weak-supervision in 3D at a reduced supervision cost.
arXiv Detail & Related papers (2021-06-07T22:09:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.