3M3D: Multi-view, Multi-path, Multi-representation for 3D Object
Detection
- URL: http://arxiv.org/abs/2302.08231v3
- Date: Fri, 28 Jul 2023 10:51:37 GMT
- Title: 3M3D: Multi-view, Multi-path, Multi-representation for 3D Object
Detection
- Authors: Jongwoo Park, Apoorv Singh, Varun Bankiti
- Abstract summary: We propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection.
We update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view.
We show performance improvements on nuScenes benchmark dataset on top of our baselines.
- Score: 0.5156484100374059
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D visual perception tasks based on multi-camera images are essential for
autonomous driving systems. Latest work in this field performs 3D object
detection by leveraging multi-view images as an input and iteratively enhancing
object queries (object proposals) by cross-attending multi-view features.
However, individual backbone features are not updated with multi-view features
and it stays as a mere collection of the output of the single-image backbone
network. Therefore we propose 3M3D: A Multi-view, Multi-path,
Multi-representation for 3D Object Detection where we update both multi-view
features and query features to enhance the representation of the scene in both
fine panoramic view and coarse global view. Firstly, we update multi-view
features by multi-view axis self-attention. It will incorporate panoramic
information in the multi-view features and enhance understanding of the global
scene. Secondly, we update multi-view features by self-attention of the ROI
(Region of Interest) windows which encodes local finer details in the features.
It will help exchange the information not only along the multi-view axis but
also along the other spatial dimension. Lastly, we leverage the fact of
multi-representation of queries in different domains to further boost the
performance. Here we use sparse floating queries along with dense BEV (Bird's
Eye View) queries, which are later post-processed to filter duplicate
detections. Moreover, we show performance improvements on nuScenes benchmark
dataset on top of our baselines.
Related papers
- PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion [61.37481051263816]
Given a single image of a 3D object, this paper proposes a method (named ConsistNet) that is able to generate multiple images of the same object.
Our method effectively learns 3D consistency over a frozen Zero123 backbone and can generate 16 surrounding views of the object within 40 seconds on a single A100 GPU.
arXiv Detail & Related papers (2023-10-16T12:29:29Z) - SCA-PVNet: Self-and-Cross Attention Based Aggregation of Point Cloud and
Multi-View for 3D Object Retrieval [8.74845857766369]
Multi-modality 3D object retrieval is rarely developed and analyzed on large-scale datasets.
We propose self-and-cross attention based aggregation of point cloud and multi-view images (SCA-PVNet) for 3D object retrieval.
arXiv Detail & Related papers (2023-07-20T05:46:32Z) - ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with
GPT and Prototype Guidance [48.748738590964216]
We propose ViewRefer, a multi-view framework for 3D visual grounding.
For the text branch, ViewRefer expands a single grounding text to multiple geometry-consistent descriptions.
In the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views.
arXiv Detail & Related papers (2023-03-29T17:59:10Z) - A Simple Baseline for Multi-Camera 3D Object Detection [94.63944826540491]
3D object detection with surrounding cameras has been a promising direction for autonomous driving.
We present SimMOD, a Simple baseline for Multi-camera Object Detection.
We conduct extensive experiments on the 3D object detection benchmark of nuScenes to demonstrate the effectiveness of SimMOD.
arXiv Detail & Related papers (2022-08-22T03:38:01Z) - Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in
Large Scenes [50.744452135300115]
We propose a deep neural network framework for multi-view crowd counting.
Our methods achieve state-of-the-art results compared to other multi-view counting baselines.
arXiv Detail & Related papers (2020-12-02T03:20:30Z) - Multiview Detection with Feature Perspective Transformation [59.34619548026885]
We propose a novel multiview detection system, MVDet.
We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane.
Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
arXiv Detail & Related papers (2020-07-14T17:58:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.