CCTrans: Simplifying and Improving Crowd Counting with Transformer
- URL: http://arxiv.org/abs/2109.14483v1
- Date: Wed, 29 Sep 2021 15:13:10 GMT
- Title: CCTrans: Simplifying and Improving Crowd Counting with Transformer
- Authors: Ye Tian, Xiangxiang Chu, Hongpeng Wang
- Abstract summary: We propose a simple approach called CCTrans to simplify the design pipeline.
Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information.
Our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting.
- Score: 7.597392692171026
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most recent methods used for crowd counting are based on the convolutional
neural network (CNN), which has a strong ability to extract local features. But
CNN inherently fails in modeling the global context due to the limited
receptive fields. However, the transformer can model the global context easily.
In this paper, we propose a simple approach called CCTrans to simplify the
design pipeline. Specifically, we utilize a pyramid vision transformer backbone
to capture the global crowd information, a pyramid feature aggregation (PFA)
model to combine low-level and high-level features, an efficient regression
head with multi-scale dilated convolution (MDC) to predict density maps.
Besides, we tailor the loss functions for our pipeline. Without bells and
whistles, extensive experiments demonstrate that our method achieves new
state-of-the-art results on several benchmarks both in weakly and
fully-supervised crowd counting. Moreover, we currently rank No.1 on the
leaderboard of NWPU-Crowd. Our code will be made available.
Related papers
- Stratified Transformer for 3D Point Cloud Segmentation [89.9698499437732]
Stratified Transformer is able to capture long-range contexts and demonstrates strong generalization ability and high performance.
To combat the challenges posed by irregular point arrangements, we propose first-layer point embedding to aggregate local information.
Experiments demonstrate the effectiveness and superiority of our method on S3DIS, ScanNetv2 and ShapeNetPart datasets.
arXiv Detail & Related papers (2022-03-28T05:35:16Z) - Joint CNN and Transformer Network via weakly supervised Learning for
efficient crowd counting [22.040942519355628]
We propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised learning for crowd counting.
JCTNet can effectively focus on the crowd regions and obtain superior weakly supervised counting performance on five mainstream datasets.
arXiv Detail & Related papers (2022-03-12T09:40:29Z) - CrowdFormer: Weakly-supervised Crowd counting with Improved
Generalizability [2.8174125805742416]
We propose a weakly-supervised method for crowd counting using a pyramid vision transformer.
Our method is comparable to the state-of-the-art on the benchmark crowd datasets.
arXiv Detail & Related papers (2022-03-07T23:10:40Z) - Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection.
With the global view in very shallow layers, the transformer encoder preserves more local representations.
Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - TransCrowd: Weakly-Supervised Crowd Counting with Transformer [56.84516562735186]
We propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on Transformer.
Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods.
arXiv Detail & Related papers (2021-04-19T08:12:50Z) - CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image
Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation.
We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.