Where, What, Whether: Multi-modal Learning Meets Pedestrian Detection
- URL: http://arxiv.org/abs/2012.10880v1
- Date: Sun, 20 Dec 2020 10:15:39 GMT
- Title: Where, What, Whether: Multi-modal Learning Meets Pedestrian Detection
- Authors: Yan Luo, Chongyang Zhang, Muming Zhao, Hao Zhou, Jun Sun
- Abstract summary: We decompose the pedestrian detection task into textbftextitWhere, textbftextitWhat and textbftextitWhether.
We achieve state-of-the-art results on widely used datasets (Citypersons and Caltech)
- Score: 23.92066492219922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pedestrian detection benefits greatly from deep convolutional neural networks
(CNNs). However, it is inherently hard for CNNs to handle situations in the
presence of occlusion and scale variation. In this paper, we propose W$^3$Net,
which attempts to address above challenges by decomposing the pedestrian
detection task into \textbf{\textit{W}}here, \textbf{\textit{W}}hat and
\textbf{\textit{W}}hether problem directing against pedestrian localization,
scale prediction and classification correspondingly. Specifically, for a
pedestrian instance, we formulate its feature by three steps. i) We generate a
bird view map, which is naturally free from occlusion issues, and scan all
points on it to look for suitable locations for each pedestrian instance. ii)
Instead of utilizing pre-fixed anchors, we model the interdependency between
depth and scale aiming at generating depth-guided scales at different locations
for better matching instances of different sizes. iii) We learn a latent vector
shared by both visual and corpus space, by which false positives with similar
vertical structure but lacking human partial features would be filtered out. We
achieve state-of-the-art results on widely used datasets (Citypersons and
Caltech). In particular. when evaluating on heavy occlusion subset, our results
reduce MR$^{-2}$ from 49.3$\%$ to 18.7$\%$ on Citypersons, and from 45.18$\%$
to 28.33$\%$ on Caltech.
Related papers
- GLACE: Global Local Accelerated Coordinate Encoding [66.87005863868181]
Scene coordinate regression methods are effective in small-scale scenes but face significant challenges in large-scale scenes.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
Our method achieves state-of-the-art results on large-scale scenes with a low-map-size model.
arXiv Detail & Related papers (2024-06-06T17:59:50Z) - VoxelKP: A Voxel-based Network Architecture for Human Keypoint
Estimation in LiDAR Data [53.638818890966036]
textitVoxelKP is a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data.
We introduce sparse box-attention to focus on learning spatial correlations between keypoints within each human instance.
We incorporate a spatial encoding to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view.
arXiv Detail & Related papers (2023-12-11T23:50:14Z) - Graph R-CNN: Towards Accurate 3D Object Detection with
Semantic-Decorated Local Graph [26.226885108862735]
Two-stage detectors have gained much popularity in 3D object detection.
Most two-stage 3D detectors utilize grid points, voxel grids, or sampled keypoints for RoI feature extraction in the second stage.
This paper solves this problem in three aspects.
arXiv Detail & Related papers (2022-08-07T02:56:56Z) - GCNDepth: Self-supervised Monocular Depth Estimation based on Graph
Convolutional Network [11.332580333969302]
This work brings a new solution with a set of improvements, which increase the quantitative and qualitative understanding of depth maps.
A graph convolutional network (GCN) can handle the convolution on non-Euclidean data and it can be applied to irregular image regions within a topological structure.
Our method provided comparable and promising results with a high prediction accuracy of 89% on the publicly KITTI and Make3D datasets.
arXiv Detail & Related papers (2021-12-13T16:46:25Z) - HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization [83.57863764231655]
We propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization.
A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints.
We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets.
arXiv Detail & Related papers (2020-07-17T12:44:23Z) - Wasserstein Distances for Stereo Disparity Estimation [62.09272563885437]
Existing approaches to depth or disparity estimation output a distribution over a set of pre-defined discrete values.
This leads to inaccurate results when the true depth or disparity does not match any of these values.
We address these issues using a new neural network architecture that is capable of outputting arbitrary depth values.
arXiv Detail & Related papers (2020-07-06T21:37:50Z) - Coherent Reconstruction of Multiple Humans from a Single Image [68.3319089392548]
In this work, we address the problem of multi-person 3D pose estimation from a single image.
A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently.
Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene.
arXiv Detail & Related papers (2020-06-15T17:51:45Z) - Disp R-CNN: Stereo 3D Object Detection via Shape Prior Guided Instance
Disparity Estimation [51.17232267143098]
We propose a novel system named Disp R-CNN for 3D object detection from stereo images.
We use a statistical shape model to generate dense disparity pseudo-ground-truth without the need of LiDAR point clouds.
Experiments on the KITTI dataset show that, even when LiDAR ground-truth is not available at training time, Disp R-CNN achieves competitive performance and outperforms previous state-of-the-art methods by 20% in terms of average precision.
arXiv Detail & Related papers (2020-04-07T17:48:45Z) - DELTAS: Depth Estimation by Learning Triangulation And densification of
Sparse points [14.254472131009653]
Multi-view stereo (MVS) is the golden mean between the accuracy of active depth sensing and the practicality of monocular depth estimation.
Cost volume based approaches employing 3D convolutional neural networks (CNNs) have considerably improved the accuracy of MVS systems.
We propose an efficient depth estimation approach by first (a) detecting and evaluating descriptors for interest points, then (b) learning to match and triangulate a small set of interest points, and finally (c) densifying this sparse set of 3D points using CNNs.
arXiv Detail & Related papers (2020-03-19T17:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.