Spatio-channel Attention Blocks for Cross-modal Crowd Counting
- URL: http://arxiv.org/abs/2210.10392v1
- Date: Wed, 19 Oct 2022 09:05:00 GMT
- Title: Spatio-channel Attention Blocks for Cross-modal Crowd Counting
- Authors: Youjia Zhang, Soyun Choi, and Sungeun Hong
- Abstract summary: Cross-modal Spatio-Channel Attention (CSCA) blocks can be easily integrated into any modality-specific architecture.
In our experiments, the proposed block consistently shows significant performance improvement across various backbone networks.
- Score: 3.441021278275805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd counting research has made significant advancements in real-world
applications, but it remains a formidable challenge in cross-modal settings.
Most existing methods rely solely on the optical features of RGB images,
ignoring the feasibility of other modalities such as thermal and depth images.
The inherently significant differences between the different modalities and the
diversity of design choices for model architectures make cross-modal crowd
counting more challenging. In this paper, we propose Cross-modal Spatio-Channel
Attention (CSCA) blocks, which can be easily integrated into any
modality-specific architecture. The CSCA blocks first spatially capture global
functional correlations among multi-modality with less overhead through
spatial-wise cross-modal attention. Cross-modal features with spatial attention
are subsequently refined through adaptive channel-wise feature aggregation. In
our experiments, the proposed block consistently shows significant performance
improvement across various backbone networks, resulting in state-of-the-art
results in RGB-T and RGB-D crowd counting.
Related papers
- Cross-Modality Perturbation Synergy Attack for Person Re-identification [66.48494594909123]
The main challenge in cross-modality ReID lies in effectively dealing with visual differences between different modalities.
Existing attack methods have primarily focused on the characteristics of the visible image modality.
This study proposes a universal perturbation attack specifically designed for cross-modality ReID.
arXiv Detail & Related papers (2024-01-18T15:56:23Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for
Fast Multispectral Pedestrian Detection [16.43119521684829]
We propose a novel framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity.
Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF)
We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy.
arXiv Detail & Related papers (2023-08-02T09:35:21Z) - MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting [40.4816930622052]
We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet)
In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion.
Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
arXiv Detail & Related papers (2022-08-14T02:42:09Z) - Multi-Scale Iterative Refinement Network for RGB-D Salient Object
Detection [7.062058947498447]
salient visual cues appear in various scales and resolutions of RGB images due to semantic gaps at different feature levels.
Similar salient patterns are available in cross-modal depth images as well as multi-scale versions.
We devise attention based fusion module (ABF) to address on cross-modal correlation.
arXiv Detail & Related papers (2022-01-24T10:33:00Z) - Cross-SRN: Structure-Preserving Super-Resolution Network with Cross
Convolution [64.76159006851151]
It is challenging to restore low-resolution (LR) images to super-resolution (SR) images with correct and clear details.
Existing deep learning works almost neglect the inherent structural information of images.
We design a hierarchical feature exploitation network to probe and preserve structural information.
arXiv Detail & Related papers (2022-01-05T05:15:01Z) - MSO: Multi-Feature Space Joint Optimization Network for RGB-Infrared
Person Re-Identification [35.97494894205023]
RGB-infrared cross-modality person re-identification (ReID) task aims to recognize the images of the same identity between the visible modality and the infrared modality.
Existing methods mainly use a two-stream architecture to eliminate the discrepancy between the two modalities in the final common feature space.
We present a novel multi-feature space joint optimization (MSO) network, which can learn modality-sharable features in both the single-modality space and the common space.
arXiv Detail & Related papers (2021-10-21T16:45:23Z) - Hierarchical Deep CNN Feature Set-Based Representation Learning for
Robust Cross-Resolution Face Recognition [59.29808528182607]
Cross-resolution face recognition (CRFR) is important in intelligent surveillance and biometric forensics.
Existing shallow learning-based and deep learning-based methods focus on mapping the HR-LR face pairs into a joint feature space.
In this study, we desire to fully exploit the multi-level deep convolutional neural network (CNN) feature set for robust CRFR.
arXiv Detail & Related papers (2021-03-25T14:03:42Z) - Multi-Scale Cascading Network with Compact Feature Learning for
RGB-Infrared Person Re-Identification [35.55895776505113]
Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global.
Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning.
arXiv Detail & Related papers (2020-12-12T15:39:11Z) - Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT
Benchmark for Crowd Counting [109.32927895352685]
We introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people.
To facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework.
Experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting.
arXiv Detail & Related papers (2020-12-08T16:18:29Z) - Crowd Counting via Hierarchical Scale Recalibration Network [61.09833400167511]
We propose a novel Hierarchical Scale Recalibration Network (HSRNet) to tackle the task of crowd counting.
HSRNet models rich contextual dependencies and recalibrating multiple scale-associated information.
Our approach can ignore various noises selectively and focus on appropriate crowd scales automatically.
arXiv Detail & Related papers (2020-03-07T10:06:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.