Voxelized 3D Feature Aggregation for Multiview Detection
- URL: http://arxiv.org/abs/2112.03471v1
- Date: Tue, 7 Dec 2021 03:38:50 GMT
- Title: Voxelized 3D Feature Aggregation for Multiview Detection
- Authors: Jiahao Ma, Jinguang Tong, Shan Wang, Wei Zhao, Liang Zheng, Chuong
Nguyen
- Abstract summary: We propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection.
Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels.
This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent.
- Score: 15.465855460519446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-view detection incorporates multiple camera views to alleviate
occlusion in crowded scenes, where the state-of-the-art approaches adopt
homography transformations to project multi-view features to the ground plane.
However, we find that these 2D transformations do not take into account the
object's height, and with this neglection features along the vertical direction
of same object are likely not projected onto the same ground plane point,
leading to impure ground-plane features. To solve this problem, we propose VFA,
voxelized 3D feature aggregation, for feature transformation and aggregation in
multi-view detection. Specifically, we voxelize the 3D space, project the
voxels onto each camera view, and associate 2D features with these projected
voxels. This allows us to identify and then aggregate 2D features along the
same vertical line, alleviating projection distortions to a large extent.
Additionally, because different kinds of objects (human vs. cattle) have
different shapes on the ground plane, we introduce the oriented Gaussian
encoding to match such shapes, leading to increased accuracy and efficiency. We
perform experiments on multiview 2D detection and multiview 3D detection
problems. Results on four datasets (including a newly introduced MultiviewC
dataset) show that our system is very competitive compared with the
state-of-the-art approaches. %Our code and data will be open-sourced.Code and
MultiviewC are released at https://github.com/Robert-Mar/VFA.
Related papers
- Towards Generalizable Multi-Camera 3D Object Detection via Perspective
Debiasing [28.874014617259935]
Multi-Camera 3D Object Detection (MC3D-Det) has gained prominence with the advent of bird's-eye view (BEV) approaches.
We propose a novel method that aligns 3D detection with 2D camera plane results, ensuring consistent and accurate detections.
arXiv Detail & Related papers (2023-10-17T15:31:28Z) - VoxDet: Voxel Learning for Novel Instance Detection [15.870525460969553]
VoxDet is a 3D geometry-aware framework for detecting unseen instances.
Our framework fully utilizes the strong 3D voxel representation and reliable voxel matching mechanism.
To the best of our knowledge, VoxDet is the first to incorporate implicit 3D knowledge for 2D novel instance detection tasks.
arXiv Detail & Related papers (2023-05-26T19:25:13Z) - Viewpoint Equivariance for Multi-View 3D Object Detection [35.4090127133834]
State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input.
We introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry.
arXiv Detail & Related papers (2023-03-25T19:56:41Z) - Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction [84.94140661523956]
We propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes.
We model each point in the 3D space by summing its projected features on the three planes.
Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels.
arXiv Detail & Related papers (2023-02-15T17:58:10Z) - MVTN: Learning Multi-View Transformations for 3D Understanding [60.15214023270087]
We introduce the Multi-View Transformation Network (MVTN), which uses differentiable rendering to determine optimal view-points for 3D shape recognition.
MVTN can be trained end-to-end with any multi-view network for 3D shape recognition.
Our approach demonstrates state-of-the-art performance in 3D classification and shape retrieval on several benchmarks.
arXiv Detail & Related papers (2022-12-27T12:09:16Z) - PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal
Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student.
By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z) - Multiview Detection with Feature Perspective Transformation [59.34619548026885]
We propose a novel multiview detection system, MVDet.
We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane.
Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
arXiv Detail & Related papers (2020-07-14T17:58:30Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z) - 3D Crowd Counting via Geometric Attention-guided Multi-View Fusion [50.520192402702015]
We propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps.
Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views.
The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density.
arXiv Detail & Related papers (2020-03-18T11:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.