MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting
- URL: http://arxiv.org/abs/2208.06761v1
- Date: Sun, 14 Aug 2022 02:42:09 GMT
- Title: MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting
- Authors: Pengyu Chen, Junyu Gao, Yuan Yuan, Qi Wang
- Abstract summary: We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet)
In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion.
Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
- Score: 40.4816930622052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RGB-Thermal (RGB-T) crowd counting is a challenging task, which uses thermal
images as complementary information to RGB images to deal with the decreased
performance of unimodal RGB-based methods in scenes with low-illumination or
similar backgrounds. Most existing methods propose well-designed structures for
cross-modal fusion in RGB-T crowd counting. However, these methods have
difficulty in encoding cross-modal contextual semantic information in RGB-T
image pairs. Considering the aforementioned problem, we propose a two-stream
RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet),
which aims to fully capture long-range contextual information from the RGB and
thermal modalities based on the attention mechanism. Specifically, in the
encoder part, a Multi-Attention Fusion (MAF) module is embedded into different
stages of the two modality-specific branches for cross-modal fusion at the
global level. In addition, a Multi-modal Multi-scale Aggregation (MMA)
regression head is introduced to make full use of the multi-scale and
contextual information across modalities to generate high-quality crowd density
maps. Extensive experiments on two popular datasets show that the proposed
MAFNet is effective for RGB-T crowd counting and achieves the state-of-the-art
performance.
Related papers
- HODINet: High-Order Discrepant Interaction Network for RGB-D Salient
Object Detection [4.007827908611563]
RGB-D salient object detection (SOD) aims to detect the prominent regions by jointly modeling RGB and depth information.
Most RGB-D SOD methods apply the same type of backbones and fusion modules to identically learn the multimodality and multistage features.
In this paper, we propose a high-order discrepant interaction network (HODINet) for RGB-D SOD.
arXiv Detail & Related papers (2023-07-03T11:56:21Z) - Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness.
Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation.
This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z) - A Multi-modal Approach to Single-modal Visual Place Classification [2.580765958706854]
Multi-sensor fusion approaches combining RGB and depth (D) have gained popularity in recent years.
We reformulate the single-modal RGB image classification task as a pseudo multi-modal RGB-D classification problem.
A practical, fully self-supervised framework for training, appropriately processing, fusing, and classifying these two modalities is described.
arXiv Detail & Related papers (2023-05-10T14:04:21Z) - Interactive Context-Aware Network for RGB-T Salient Object Detection [7.544240329265388]
We propose a novel network called Interactive Context-Aware Network (ICANet)
ICANet contains three modules that can effectively perform the cross-modal and cross-scale fusions.
Experiments prove that our network performs favorably against the state-of-the-art RGB-T SOD methods.
arXiv Detail & Related papers (2022-11-11T10:04:36Z) - MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object
Detection [15.371153771528093]
We propose a novel Mutual-Transformer Fusion Network (MTFNet) for RGB-D SOD.
MTFNet contains two main modules, $i.e.$, Focal Feature Extractor (FFE) and Mutual-Transformer Fusion (MTF)
Comprehensive experimental results on six public benchmarks demonstrate the superiority of our proposed MTFNet.
arXiv Detail & Related papers (2021-12-02T12:48:37Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT
Benchmark for Crowd Counting [109.32927895352685]
We introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people.
To facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework.
Experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting.
arXiv Detail & Related papers (2020-12-08T16:18:29Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.