Efficient Human Pose Estimation by Learning Deeply Aggregated
Representations
- URL: http://arxiv.org/abs/2012.07033v2
- Date: Tue, 15 Dec 2020 02:48:52 GMT
- Title: Efficient Human Pose Estimation by Learning Deeply Aggregated
Representations
- Authors: Zhengxiong Luo, Zhicheng Wang, Yuanhao Cai, Guanan Wang, Yan Huang,
Liang Wang, Erjin Zhou, Tieniu Tan, Jian Sun
- Abstract summary: We propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations.
Our networks could achieve comparable or even better accuracy with much smaller model complexity.
- Score: 67.24496300046255
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose an efficient human pose estimation network (DANet)
by learning deeply aggregated representations. Most existing models explore
multi-scale information mainly from features with different spatial sizes.
Powerful multi-scale representations usually rely on the cascaded pyramid
framework. This framework largely boosts the performance but in the meanwhile
makes networks very deep and complex. Instead, we focus on exploiting
multi-scale information from layers with different receptive-field sizes and
then making full of use this information by improving the fusion method.
Specifically, we propose an orthogonal attention block (OAB) and a second-order
fusion unit (SFU). The OAB learns multi-scale information from different layers
and enhances them by encouraging them to be diverse. The SFU adaptively selects
and fuses diverse multi-scale information and suppress the redundant ones. This
could maximize the effective information in final fused representations. With
the help of OAB and SFU, our single pyramid network may be able to generate
deeply aggregated representations that contain even richer multi-scale
information and have a larger representing capacity than that of cascaded
networks. Thus, our networks could achieve comparable or even better accuracy
with much smaller model complexity. Specifically, our \mbox{DANet-72} achieves
$70.5$ in AP score on COCO test-dev set with only $1.0G$ FLOPs. Its speed on a
CPU platform achieves $58$ Persons-Per-Second~(PPS).
Related papers
- Multimodality Helps Few-Shot 3D Point Cloud Semantic Segmentation [61.91492500828508]
Few-shot 3D point cloud segmentation (FS-PCS) aims at generalizing models to segment novel categories with minimal support samples.
We introduce a cost-free multimodal FS-PCS setup, utilizing textual labels and the potentially available 2D image modality.
We propose a simple yet effective Test-time Adaptive Cross-modal Seg (TACC) technique to mitigate training bias.
arXiv Detail & Related papers (2024-10-29T19:28:41Z) - Cooperation Learning Enhanced Colonic Polyp Segmentation Based on
Transformer-CNN Fusion [21.6402447417878]
We propose a hybrid network called Fusion-Transformer-HardNetMSEG (i.e., Fu-TransHNet) in this study.
Fu-TransHNet uses deep learning of different mechanisms to fuse each other and is enhanced with multi-view collaborative learning techniques.
Experimental results showed that the Fu-TransHNet network was superior to the existing methods on five widely used benchmark datasets.
arXiv Detail & Related papers (2023-01-17T13:58:17Z) - Multi-modal land cover mapping of remote sensing images using pyramid
attention and gated fusion networks [20.66034058363032]
We propose a new multi-modality network for land cover mapping of multi-modal remote sensing data based on a novel pyramid attention fusion (PAF) module and a gated fusion unit (GFU)
PAF module is designed to efficiently obtain rich fine-grained contextual representations from each modality with a built-in cross-level and cross-view attention fusion mechanism.
GFU module utilizes a novel gating mechanism for early merging of features, thereby diminishing hidden redundancies and noise.
arXiv Detail & Related papers (2021-11-06T10:01:01Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - CNN based Multistage Gated Average Fusion (MGAF) for Human Action
Recognition Using Depth and Inertial Sensors [1.52292571922932]
Convolutional Neural Network (CNN) provides leverage to extract and fuse features from all layers of its architecture.
We propose novel Multistage Gated Average Fusion (MGAF) network which extracts and fuses features from all layers of CNN.
arXiv Detail & Related papers (2020-10-29T11:49:13Z) - $P^2$ Net: Augmented Parallel-Pyramid Net for Attention Guided Pose
Estimation [69.25492391672064]
We propose an augmented Parallel-Pyramid Net with feature refinement by dilated bottleneck and attention module.
A parallel-pyramid structure is followed to compensate the information loss introduced by the network.
Our method achieves the best performance on the challenging MSCOCO and MPII datasets.
arXiv Detail & Related papers (2020-10-26T02:10:12Z) - Bifurcated backbone strategy for RGB-D salient object detection [168.19708737906618]
We leverage the inherent multi-modal and multi-level nature of RGB-D salient object detection to devise a novel cascaded refinement network.
Our architecture, named Bifurcated Backbone Strategy Network (BBS-Net), is simple, efficient, and backbone-independent.
arXiv Detail & Related papers (2020-07-06T13:01:30Z) - Multi-organ Segmentation over Partially Labeled Datasets with
Multi-scale Feature Abstraction [14.92032083210668]
Shortage of fully annotated datasets has been a limiting factor in developing deep learning based image segmentation algorithms.
We propose a unified training strategy that enables a novel multi-scale deep neural network to be trained on multiple partially labeled datasets.
arXiv Detail & Related papers (2020-01-01T13:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.