Joint CNN and Transformer Network via weakly supervised Learning for
efficient crowd counting
- URL: http://arxiv.org/abs/2203.06388v1
- Date: Sat, 12 Mar 2022 09:40:29 GMT
- Title: Joint CNN and Transformer Network via weakly supervised Learning for
efficient crowd counting
- Authors: Fusen Wang, Kai Liu, Fei Long, Nong Sang, Xiaofeng Xia, Jun Sang
- Abstract summary: We propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised learning for crowd counting.
JCTNet can effectively focus on the crowd regions and obtain superior weakly supervised counting performance on five mainstream datasets.
- Score: 22.040942519355628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Currently, for crowd counting, the fully supervised methods via density map
estimation are the mainstream research directions. However, such methods need
location-level annotation of persons in an image, which is time-consuming and
laborious. Therefore, the weakly supervised method just relying upon the
count-level annotation is urgently needed. Since CNN is not suitable for
modeling the global context and the interactions between image patches, crowd
counting with weakly supervised learning via CNN generally can not show good
performance. The weakly supervised model via Transformer was sequentially
proposed to model the global context and learn contrast features. However, the
transformer directly partitions the crowd images into a series of tokens, which
may not be a good choice due to each pedestrian being an independent
individual, and the parameter number of the network is very large. Hence, we
propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised
learning for crowd counting in this paper. JCTNet consists of three parts: CNN
feature extraction module (CFM), Transformer feature extraction module (TFM),
and counting regression module (CRM). In particular, the CFM extracts crowd
semantic information features, then sends their patch partitions to TRM for
modeling global context, and CRM is used to predict the number of people.
Extensive experiments and visualizations demonstrate that JCTNet can
effectively focus on the crowd regions and obtain superior weakly supervised
counting performance on five mainstream datasets. The number of parameters of
the model can be reduced by about 67%~73% compared with the pure Transformer
works. We also tried to explain the phenomenon that a model constrained only by
count-level annotations can still focus on the crowd regions. We believe our
work can promote further research in this field.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - CrowdFormer: Weakly-supervised Crowd counting with Improved
Generalizability [2.8174125805742416]
We propose a weakly-supervised method for crowd counting using a pyramid vision transformer.
Our method is comparable to the state-of-the-art on the benchmark crowd datasets.
arXiv Detail & Related papers (2022-03-07T23:10:40Z) - CCTrans: Simplifying and Improving Crowd Counting with Transformer [7.597392692171026]
We propose a simple approach called CCTrans to simplify the design pipeline.
Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information.
Our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting.
arXiv Detail & Related papers (2021-09-29T15:13:10Z) - Container: Context Aggregation Network [83.12004501984043]
Recent finding shows that a simple based solution without any traditional convolutional or Transformer components can produce effective visual representations.
We present the model (CONText Ion NERtwok), a general-purpose building block for multi-head context aggregation.
In contrast to Transformer-based methods that do not scale well to downstream tasks that rely on larger input image resolutions, our efficient network, named modellight, can be employed in object detection and instance segmentation networks.
arXiv Detail & Related papers (2021-06-02T18:09:11Z) - TransCrowd: Weakly-Supervised Crowd Counting with Transformer [56.84516562735186]
We propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on Transformer.
Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods.
arXiv Detail & Related papers (2021-04-19T08:12:50Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - PSCNet: Pyramidal Scale and Global Context Guided Network for Crowd
Counting [44.306790250158954]
This paper proposes a novel crowd counting approach based on pyramidal scale module (PSM) and global context module (GCM)
PSM is used to adaptively capture multi-scale information, which can identify a fine boundary of crowds with different image scales.
GCM is devised with low-complexity and lightweight manner, to make the interactive information across the channels of the feature maps more efficient.
arXiv Detail & Related papers (2020-12-07T11:35:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.