M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers
- URL: http://arxiv.org/abs/2104.11896v1
- Date: Sat, 24 Apr 2021 06:48:23 GMT
- Title: M3DeTR: Multi-representation, Multi-scale, Mutual-relation 3D Object
Detection with Transformers
- Authors: Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zuxuan Wu, Larry
Davis, Dinesh Manocha
- Abstract summary: We present M3DeTR, which combines different point cloud representations with different feature scales based on multi-scale feature pyramids.
M3DeTR is the first approach that unifies multiple point cloud representations, feature scales, as well as models mutual relationships between point clouds simultaneously using transformers.
- Score: 78.48081972698888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel architecture for 3D object detection, M3DeTR, which
combines different point cloud representations (raw, voxels, bird-eye view)
with different feature scales based on multi-scale feature pyramids. M3DeTR is
the first approach that unifies multiple point cloud representations, feature
scales, as well as models mutual relationships between point clouds
simultaneously using transformers. We perform extensive ablation experiments
that highlight the benefits of fusing representation and scale, and modeling
the relationships. Our method achieves state-of-the-art performance on the
KITTI 3D object detection dataset and Waymo Open Dataset. Results show that
M3DeTR improves the baseline significantly by 1.48% mAP for all classes on
Waymo Open Dataset. In particular, our approach ranks 1st on the well-known
KITTI 3D Detection Benchmark for both car and cyclist classes, and ranks 1st on
Waymo Open Dataset with single frame point cloud input.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - FusionViT: Hierarchical 3D Object Detection via LiDAR-Camera Vision
Transformer Fusion [8.168523242105763]
We will introduce a novel vision transformer-based 3D object detection model, namely FusionViT.
Our FusionViT model can achieve state-of-the-art performance and outperforms existing baseline methods.
arXiv Detail & Related papers (2023-11-07T00:12:01Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval [8.74845857766369]
Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
arXiv Detail & Related papers (2023-07-20T05:46:32Z) - MMF-Track: Multi-modal Multi-level Fusion for 3D Single Object Tracking [26.405519771454102]
3D single object tracking plays a crucial role in computer vision.
We propose a Multi-modal Multi-level Fusion Tracker (MMF-Track), which exploits the image texture and geometry characteristic of point clouds to track 3D target.
Experiments show that our method achieves state-of-the-art performance on KITTI (39% Success and 42% Precision gains against previous multi-modal method) and is also competitive on NuScenes.
arXiv Detail & Related papers (2023-05-11T13:34:02Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Learning Feature Aggregation for Deep 3D Morphable Models [57.1266963015401]
We propose an attention based module to learn mapping matrices for better feature aggregation across hierarchical levels.
Our experiments show that through the end-to-end training of the mapping matrices, we achieve state-of-the-art results on a variety of 3D shape datasets.
arXiv Detail & Related papers (2021-05-05T16:41:00Z) - Segment as Points for Efficient Online Multi-Object Tracking and
Segmentation [66.03023110058464]
We propose a highly effective method for learning instance embeddings based on segments by converting the compact image representation to un-ordered 2D point cloud representation.
Our method generates a new tracking-by-points paradigm where discriminative instance embeddings are learned from randomly selected points rather than images.
The resulting online MOTS framework, named PointTrack, surpasses all the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2020-07-03T08:29:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.