Related papers: CCTrans: Simplifying and Improving Crowd Counting with Transformer

CCTrans: Simplifying and Improving Crowd Counting with Transformer

URL: http://arxiv.org/abs/2109.14483v1
Date: Wed, 29 Sep 2021 15:13:10 GMT
Title: CCTrans: Simplifying and Improving Crowd Counting with Transformer
Authors: Ye Tian, Xiangxiang Chu, Hongpeng Wang
Abstract summary: We propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information. Our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting.
Score: 7.597392692171026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most recent methods used for crowd counting are based on the convolutional neural network (CNN), which has a strong ability to extract local features. But CNN inherently fails in modeling the global context due to the limited receptive fields. However, the transformer can model the global context easily. In this paper, we propose a simple approach called CCTrans to simplify the design pipeline. Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information, a pyramid feature aggregation (PFA) model to combine low-level and high-level features, an efficient regression head with multi-scale dilated convolution (MDC) to predict density maps. Besides, we tailor the loss functions for our pipeline. Without bells and whistles, extensive experiments demonstrate that our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting. Moreover, we currently rank No.1 on the leaderboard of NWPU-Crowd. Our code will be made available.

Related papers

Stratified Transformer for 3D Point Cloud Segmentation [89.9698499437732]
Stratified Transformer is able to capture long-range contexts and demonstrates strong generalization ability and high performance. To combat the challenges posed by irregular point arrangements, we propose first-layer point embedding to aggregate local information. Experiments demonstrate the effectiveness and superiority of our method on S3DIS, ScanNetv2 and ShapeNetPart datasets.
arXiv Detail & Related papers (2022-03-28T05:35:16Z)
Joint CNN and Transformer Network via weakly supervised Learning for efficient crowd counting [22.040942519355628]
We propose a Joint CNN and Transformer Network (JCTNet) via weakly supervised learning for crowd counting. JCTNet can effectively focus on the crowd regions and obtain superior weakly supervised counting performance on five mainstream datasets.
arXiv Detail & Related papers (2022-03-12T09:40:29Z)
CrowdFormer: Weakly-supervised Crowd counting with Improved Generalizability [2.8174125805742416]
We propose a weakly-supervised method for crowd counting using a pyramid vision transformer. Our method is comparable to the state-of-the-art on the benchmark crowd datasets.
arXiv Detail & Related papers (2022-03-07T23:10:40Z)
Unifying Global-Local Representations in Salient Object Detection with Transformer [55.23033277636774]
We introduce a new attention-based encoder, vision transformer, into salient object detection. With the global view in very shallow layers, the transformer encoder preserves more local representations. Our method significantly outperforms other FCN-based and transformer-based methods in five benchmarks.
arXiv Detail & Related papers (2021-08-05T17:51:32Z)
HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance. Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z)
Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z)
TransCrowd: Weakly-Supervised Crowd Counting with Transformer [56.84516562735186]
We propose TransCrowd, which reformulates the weakly-supervised crowd counting problem from the perspective of sequence-to-count based on Transformer. Experiments on five benchmark datasets demonstrate that the proposed TransCrowd achieves superior performance compared with all the weakly-supervised CNN-based counting methods.
arXiv Detail & Related papers (2021-04-19T08:12:50Z)
CoTr: Efficiently Bridging CNN and Transformer for 3D Medical Image Segmentation [95.51455777713092]
Convolutional neural networks (CNNs) have been the de facto standard for nowadays 3D medical image segmentation. We propose a novel framework that efficiently bridges a bf Convolutional neural network and a bf Transformer bf (CoTr) for accurate 3D medical image segmentation.
arXiv Detail & Related papers (2021-03-04T13:34:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.