Boosting Crowd Counting via Multifaceted Attention
- URL: http://arxiv.org/abs/2203.02636v1
- Date: Sat, 5 Mar 2022 01:36:43 GMT
- Title: Boosting Crowd Counting via Multifaceted Attention
- Authors: Hui Lin and Zhiheng Ma and Rongrong Ji and Yaowei Wang and Xiaopeng
Hong
- Abstract summary: Large-scale variations often exist within crowd images.
Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation.
We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
- Score: 109.89185492364386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper focuses on the challenging crowd counting task. As large-scale
variations often exist within crowd images, neither fixed-size convolution
kernel of CNN nor fixed-size attention of recent vision transformers can well
handle this kind of variation. To address this problem, we propose a
Multifaceted Attention Network (MAN) to improve transformer models in local
spatial relation encoding. MAN incorporates global attention from a vanilla
transformer, learnable local attention, and instance attention into a counting
model. Firstly, the local Learnable Region Attention (LRA) is proposed to
assign attention exclusively for each feature location dynamically. Secondly,
we design the Local Attention Regularization to supervise the training of LRA
by minimizing the deviation among the attention for different feature
locations. Finally, we provide an Instance Attention mechanism to focus on the
most important instances dynamically during training. Extensive experiments on
four challenging crowd counting datasets namely ShanghaiTech, UCF-QNRF, JHU++,
and NWPU have validated the proposed method. Codes:
https://github.com/LoraLinH/Boosting-Crowd-Counting-via-Multifaceted-Attention.
Related papers
- A Primal-Dual Framework for Transformers and Neural Networks [52.814467832108875]
Self-attention is key to the remarkable success of transformers in sequence modeling tasks.
We show that the self-attention corresponds to the support vector expansion derived from a support vector regression problem.
We propose two new attentions: Batch Normalized Attention (Attention-BN) and Attention with Scaled Head (Attention-SH)
arXiv Detail & Related papers (2024-06-19T19:11:22Z) - Diffusion-based Data Augmentation for Object Counting Problems [62.63346162144445]
We develop a pipeline that utilizes a diffusion model to generate extensive training data.
We are the first to generate images conditioned on a location dot map with a diffusion model.
Our proposed counting loss for the diffusion model effectively minimizes the discrepancies between the location dot map and the crowd images generated.
arXiv Detail & Related papers (2024-01-25T07:28:22Z) - Preserving Locality in Vision Transformers for Class Incremental
Learning [54.696808348218426]
We find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features.
We devise a Locality-Preserved Attention layer to emphasize the importance of local features.
The improved model gets consistently better performance on CIFAR100 and ImageNet100.
arXiv Detail & Related papers (2023-04-14T07:42:21Z) - CrowdFormer: Weakly-supervised Crowd counting with Improved
Generalizability [2.8174125805742416]
We propose a weakly-supervised method for crowd counting using a pyramid vision transformer.
Our method is comparable to the state-of-the-art on the benchmark crowd datasets.
arXiv Detail & Related papers (2022-03-07T23:10:40Z) - Reinforcing Local Feature Representation for Weakly-Supervised Dense
Crowd Counting [21.26385035473938]
We propose a self-adaptive feature similarity learning network and a global-local consistency loss to reinforce local representation.
Our proposed method based on different backbones narrows the gap between weakly-supervised and fully-supervised dense crowd counting.
arXiv Detail & Related papers (2022-02-22T05:53:51Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - Scene-Adaptive Attention Network for Crowd Counting [31.29858034122248]
This paper proposes a scene-adaptive attention network, termed SAANet.
We design a deformable attention in-built Transformer backbone, which learns adaptive feature representations with deformable sampling locations and dynamic attention weights.
We conduct extensive experiments on four challenging crowd counting benchmarks, demonstrating that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-31T15:03:17Z) - Congested Crowd Instance Localization with Dilated Convolutional Swin
Transformer [119.72951028190586]
Crowd localization is a new computer vision task, evolved from crowd counting.
In this paper, we focus on how to achieve precise instance localization in high-density crowd scenes.
We propose a Dilated Convolutional Swin Transformer (DCST) for congested crowd scenes.
arXiv Detail & Related papers (2021-08-02T01:27:53Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - Hybrid attention network based on progressive embedding scale-context
for crowd counting [25.866856497266884]
We propose a Hybrid Attention Network (HAN) by employing Progressive Embedding Scale-context (PES) information.
We build the hybrid attention mechanism through paralleling spatial attention and channel attention module.
PES information enables the network to simultaneously suppress noise and adapt head scale variation.
arXiv Detail & Related papers (2021-06-04T08:10:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.