RCCFormer: A Robust Crowd Counting Network Based on Transformer
- URL: http://arxiv.org/abs/2504.04935v1
- Date: Mon, 07 Apr 2025 11:19:05 GMT
- Title: RCCFormer: A Robust Crowd Counting Network Based on Transformer
- Authors: Peng Liu, Heng-Chao Li, Sen Lei, Nanqing Liu, Bin Feng, Xiao Wu,
- Abstract summary: This paper proposes a robust Transformer-based crowd counting network, termed RCCFormer.<n>The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture.<n>The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets.
- Score: 17.02332017201233
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd counting, which is a key computer vision task, has emerged as a fundamental technology in crowd analysis and public safety management. However, challenges such as scale variations and complex backgrounds significantly impact the accuracy of crowd counting. To mitigate these issues, this paper proposes a robust Transformer-based crowd counting network, termed RCCFormer, specifically designed for background suppression and scale awareness. The proposed method incorporates a Multi-level Feature Fusion Module (MFFM), which meticulously integrates features extracted at diverse stages of the backbone architecture. It establishes a strong baseline capable of capturing intricate and comprehensive feature representations, surpassing traditional baselines. Furthermore, the introduced Detail-Embedded Attention Block (DEAB) captures contextual information and local details through global self-attention and local attention along with a learnable manner for efficient fusion. This enhances the model's ability to focus on foreground regions while effectively mitigating background noise interference. Additionally, we develop an Adaptive Scale-Aware Module (ASAM), with our novel Input-dependent Deformable Convolution (IDConv) as its fundamental building block. This module dynamically adapts to changes in head target shapes and scales, significantly improving the network's capability to accommodate large-scale variations. The effectiveness of the proposed method is validated on the ShanghaiTech Part_A and Part_B, NWPU-Crowd, and QNRF datasets. The results demonstrate that our RCCFormer achieves excellent performance across all four datasets, showcasing state-of-the-art outcomes.
Related papers
- Any Image Restoration via Efficient Spatial-Frequency Degradation Adaptation [158.37640586809187]
Restoring any degraded image efficiently via just one model has become increasingly significant.
Our approach, termed AnyIR, takes a unified path that leverages inherent similarity across various degradations.
To fuse the degradation awareness and the contextualized attention, a spatial-frequency parallel fusion strategy is proposed.
arXiv Detail & Related papers (2025-04-19T09:54:46Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - FANet: Feature Amplification Network for Semantic Segmentation in Cluttered Background [9.970265640589966]
Existing deep learning approaches leave out the semantic cues that are crucial in semantic segmentation present in complex scenarios.
We propose a feature amplification network (FANet) as a backbone network that incorporates semantic information using a novel feature enhancement module at multi-stages.
Our experimental results demonstrate the state-of-the-art performance compared to existing methods.
arXiv Detail & Related papers (2024-07-12T15:57:52Z) - Band-Attention Modulated RetNet for Face Forgery Detection [44.0511745071837]
transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.
We introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts.
We present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.
arXiv Detail & Related papers (2024-04-09T05:11:28Z) - Diffusion-based Data Augmentation for Object Counting Problems [62.63346162144445]
We develop a pipeline that utilizes a diffusion model to generate extensive training data.
We are the first to generate images conditioned on a location dot map with a diffusion model.
Our proposed counting loss for the diffusion model effectively minimizes the discrepancies between the location dot map and the crowd images generated.
arXiv Detail & Related papers (2024-01-25T07:28:22Z) - Cross-receptive Focused Inference Network for Lightweight Image
Super-Resolution [64.25751738088015]
Transformer-based methods have shown impressive performance in single image super-resolution (SISR) tasks.
Transformers that need to incorporate contextual information to extract features dynamically are neglected.
We propose a lightweight Cross-receptive Focused Inference Network (CFIN) that consists of a cascade of CT Blocks mixed with CNN and Transformer.
arXiv Detail & Related papers (2022-07-06T16:32:29Z) - Scene-Adaptive Attention Network for Crowd Counting [31.29858034122248]
This paper proposes a scene-adaptive attention network, termed SAANet.
We design a deformable attention in-built Transformer backbone, which learns adaptive feature representations with deformable sampling locations and dynamic attention weights.
We conduct extensive experiments on four challenging crowd counting benchmarks, demonstrating that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-12-31T15:03:17Z) - PANet: Perspective-Aware Network with Dynamic Receptive Fields and
Self-Distilling Supervision for Crowd Counting [63.84828478688975]
We propose a novel perspective-aware approach called PANet to address the perspective problem.
Based on the observation that the size of the objects varies greatly in one image due to the perspective effect, we propose the dynamic receptive fields (DRF) framework.
The framework is able to adjust the receptive field by the dilated convolution parameters according to the input image, which helps the model to extract more discriminative features for each local region.
arXiv Detail & Related papers (2021-10-31T04:43:05Z) - Rethinking Global Context in Crowd Counting [70.54184500538338]
A pure transformer is used to extract features with global information from overlapping image patches.
Inspired by classification, we add a context token to the input sequence, to facilitate information exchange with tokens corresponding to image patches.
arXiv Detail & Related papers (2021-05-23T12:44:27Z) - Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting.
HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information.
Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.