Exploit the potential of Multi-column architecture for Crowd Counting
- URL: http://arxiv.org/abs/2007.05779v2
- Date: Tue, 28 Jul 2020 09:52:38 GMT
- Title: Exploit the potential of Multi-column architecture for Crowd Counting
- Authors: Junhao Cheng, Zhuojun Chen, XinYu Zhang, Yizhou Li, Xiaoyuan Jing
- Abstract summary: We propose a novel crowd counting framework called Pyramid Scale Network (PSNet)
For scale limitation, we adopt three Pyramid Scale Modules (PSM) to efficiently capture multi-scale features.
For feature similarity, a novel loss function named Multi-column variance loss is introduced to make the features learned by each column appropriately different from each other.
- Score: 16.186589975116387
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd counting is an important yet challenging task in computer vision due to
serious occlusions, complex background and large scale variations, etc.
Multi-column architecture is widely adopted to overcome these challenges,
yielding state-of-the-art performance in many public benchmarks. However, there
still are two issues in such design: scale limitation and feature similarity.
Further performance improvements are thus restricted. In this paper, we propose
a novel crowd counting framework called Pyramid Scale Network (PSNet) to
explicitly address these issues. Specifically, for scale limitation, we adopt
three Pyramid Scale Modules (PSM) to efficiently capture multi-scale features,
which integrate a message passing mechanism and an attention mechanism into
multi-column architecture. Moreover, for feature similarity, a novel loss
function named Multi-column variance loss is introduced to make the features
learned by each column in PSM appropriately different from each other. To the
best of our knowledge, PSNet is the first work to explicitly address scale
limitation and feature similarity in multi-column design. Extensive experiments
on five benchmark datasets demonstrate the effectiveness of the proposed
innovations as well as the superior performance over the state-of-the-art. Our
code is publicly available at: https://github.com/oahunc/Pyramid_Scale_Network
Related papers
- AsCAN: Asymmetric Convolution-Attention Networks for Efficient Recognition and Generation [48.82264764771652]
We introduce AsCAN -- a hybrid architecture, combining both convolutional and transformer blocks.
AsCAN supports a variety of tasks: recognition, segmentation, class-conditional image generation.
We then scale the same architecture to solve a large-scale text-to-image task and show state-of-the-art performance.
arXiv Detail & Related papers (2024-11-07T18:43:17Z) - Bilateral Network with Residual U-blocks and Dual-Guided Attention for
Real-time Semantic Segmentation [18.393208069320362]
We design a new fusion mechanism for two-branch architecture which is guided by attention computation.
To be precise, we use the Dual-Guided Attention (DGA) module we proposed to replace some multi-scale transformations.
Experiments on Cityscapes and CamVid dataset show the effectiveness of our method.
arXiv Detail & Related papers (2023-10-31T09:20:59Z) - General-Purpose Multimodal Transformer meets Remote Sensing Semantic
Segmentation [35.100738362291416]
Multimodal AI seeks to exploit complementary data sources, particularly for complex tasks like semantic segmentation.
Recent trends in general-purpose multimodal networks have shown great potential to achieve state-of-the-art performance.
We propose a UNet-inspired module that employs 3D convolution to encode vital local information and learn cross-modal features simultaneously.
arXiv Detail & Related papers (2023-07-07T04:58:34Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Towards efficient feature sharing in MIMO architectures [102.40140369542755]
Multi-input multi-output architectures propose to train multipleworks within one base network and then average the subnetwork predictions to benefit from ensembling for free.
Despite some relative success, these architectures are wasteful in their use of parameters.
We highlight in this paper that the learned subnetwork fail to share even generic features which limits their applicability on smaller mobile and AR/VR devices.
arXiv Detail & Related papers (2022-05-20T12:33:34Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - Query-by-Example Keyword Spotting system using Multi-head Attention and
Softtriple Loss [1.179778723980276]
This paper proposes a neural network architecture for tackling the query-by-example user-defined keyword spotting task.
A multi-head attention module is added on top of a multi-layered GRU for effective feature extraction.
We also adopt the softtriple loss - a combination of triplet loss and softmax loss - and showcase its effectiveness.
arXiv Detail & Related papers (2021-02-14T03:37:37Z) - Efficient Human Pose Estimation by Learning Deeply Aggregated
Representations [67.24496300046255]
We propose an efficient human pose estimation network (DANet) by learning deeply aggregated representations.
Our networks could achieve comparable or even better accuracy with much smaller model complexity.
arXiv Detail & Related papers (2020-12-13T10:58:07Z) - ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations.
Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z) - DFNet: Discriminative feature extraction and integration network for
salient object detection [6.959742268104327]
We focus on two aspects of challenges in saliency detection using Convolutional Neural Networks.
Firstly, since salient objects appear in various sizes, using single-scale convolution would not capture the right size.
Secondly, using multi-level features helps the model use both local and global context.
arXiv Detail & Related papers (2020-04-03T13:56:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.