Multiview Detection with Feature Perspective Transformation
- URL: http://arxiv.org/abs/2007.07247v2
- Date: Sat, 1 May 2021 11:15:13 GMT
- Title: Multiview Detection with Feature Perspective Transformation
- Authors: Yunzhong Hou, Liang Zheng, Stephen Gould
- Abstract summary: We propose a novel multiview detection system, MVDet.
We take an anchor-free approach to aggregate multiview information by projecting feature maps onto the ground plane.
Our entire model is end-to-end learnable and achieves 88.2% MODA on the standard Wildtrack dataset.
- Score: 59.34619548026885
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Incorporating multiple camera views for detection alleviates the impact of
occlusions in crowded scenes. In a multiview system, we need to answer two
important questions when dealing with ambiguities that arise from occlusions.
First, how should we aggregate cues from the multiple views? Second, how should
we aggregate unreliable 2D and 3D spatial information that has been tainted by
occlusions? To address these questions, we propose a novel multiview detection
system, MVDet. For multiview aggregation, existing methods combine anchor box
features from the image plane, which potentially limits performance due to
inaccurate anchor box shapes and sizes. In contrast, we take an anchor-free
approach to aggregate multiview information by projecting feature maps onto the
ground plane (bird's eye view). To resolve any remaining spatial ambiguity, we
apply large kernel convolutions on the ground plane feature map and infer
locations from detection peaks. Our entire model is end-to-end learnable and
achieves 88.2% MODA on the standard Wildtrack dataset, outperforming the
state-of-the-art by 14.1%. We also provide detailed analysis of MVDet on a
newly introduced synthetic dataset, MultiviewX, which allows us to control the
level of occlusion. Code and MultiviewX dataset are available at
https://github.com/hou-yz/MVDet.
Related papers
- Lifting Multi-View Detection and Tracking to the Bird's Eye View [5.679775668038154]
Recent advancements in multi-view detection and 3D object recognition have significantly improved performance.
We compare modern lifting methods, both parameter-free and parameterized, to multi-view aggregation.
We present an architecture that aggregates the features of multiple times steps to learn robust detection.
arXiv Detail & Related papers (2024-03-19T09:33:07Z) - PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs)
Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation.
We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z) - MMRDN: Consistent Representation for Multi-View Manipulation
Relationship Detection in Object-Stacked Scenes [62.20046129613934]
We propose a novel multi-view fusion framework, namely multi-view MRD network (MMRDN)
We project the 2D data from different views into a common hidden space and fit the embeddings with a set of Von-Mises-Fisher distributions.
We select a set of $K$ Maximum Vertical Neighbors (KMVN) points from the point cloud of each object pair, which encodes the relative position of these two objects.
arXiv Detail & Related papers (2023-04-25T05:55:29Z) - 3M3D: Multi-view, Multi-path, Multi-representation for 3D Object
Detection [0.5156484100374059]
We propose 3M3D: A Multi-view, Multi-path, Multi-representation for 3D Object Detection.
We update both multi-view features and query features to enhance the representation of the scene in both fine panoramic view and coarse global view.
We show performance improvements on nuScenes benchmark dataset on top of our baselines.
arXiv Detail & Related papers (2023-02-16T11:28:30Z) - DIVOTrack: A Novel Dataset and Baseline Method for Cross-View
Multi-Object Tracking in DIVerse Open Scenes [74.64897845999677]
We introduce a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians.
Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available.
Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT.
arXiv Detail & Related papers (2023-02-15T14:10:42Z) - MFFN: Multi-view Feature Fusion Network for Camouflaged Object Detection [10.04773536815808]
We propose a behavior-inspired framework, called Multi-view Feature Fusion Network (MFFN), which mimics the human behaviors of finding indistinct objects in images.
MFFN captures critical edge and semantic information by comparing and fusing extracted multi-view features.
Our method performs favorably against existing state-of-the-art methods via training with the same data.
arXiv Detail & Related papers (2022-10-12T16:12:58Z) - Voxelized 3D Feature Aggregation for Multiview Detection [15.465855460519446]
We propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection.
Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels.
This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent.
arXiv Detail & Related papers (2021-12-07T03:38:50Z) - Multiview Detection with Shadow Transformer (and View-Coherent Data
Augmentation) [25.598840284457548]
We propose a novel multiview detector, MVDeTr, that adopts a shadow transformer to aggregate multiview information.
Unlike convolutions, shadow transformer attends differently at different positions and cameras to deal with various shadow-like distortions.
We report new state-of-the-art accuracy with the proposed system.
arXiv Detail & Related papers (2021-08-12T17:59:02Z) - Wide-Area Crowd Counting: Multi-View Fusion Networks for Counting in
Large Scenes [50.744452135300115]
We propose a deep neural network framework for multi-view crowd counting.
Our methods achieve state-of-the-art results compared to other multi-view counting baselines.
arXiv Detail & Related papers (2020-12-02T03:20:30Z) - MVLidarNet: Real-Time Multi-Class Scene Understanding for Autonomous
Driving Using Multiple Views [60.538802124885414]
We present Multi-View LidarNet (MVLidarNet), a two-stage deep neural network for multi-class object detection and drivable space segmentation.
MVLidarNet is able to detect and classify objects while simultaneously determining the drivable space using a single LiDAR scan as input.
We show results on both KITTI and a much larger internal dataset, thus demonstrating the method's ability to scale by an order of magnitude.
arXiv Detail & Related papers (2020-06-09T21:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.