RGB-T Multi-Modal Crowd Counting Based on Transformer
- URL: http://arxiv.org/abs/2301.03033v1
- Date: Sun, 8 Jan 2023 12:59:52 GMT
- Title: RGB-T Multi-Modal Crowd Counting Based on Transformer
- Authors: Zhengyi Liu, Wei Wu, Yacheng Tan, Guanghui Zhang
- Abstract summary: We use count-guided multi-modal fusion and modal-guided count enhancement to achieve the impressive performance.
Experiment in public RGBT-CC dataset shows that our method refreshes the state-of-the-art results.
- Score: 8.870454119294003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd counting aims to estimate the number of persons in a scene. Most
state-of-the-art crowd counting methods based on color images can't work well
in poor illumination conditions due to invisible objects. With the widespread
use of infrared cameras, crowd counting based on color and thermal images is
studied. Existing methods only achieve multi-modal fusion without count
objective constraint. To better excavate multi-modal information, we use
count-guided multi-modal fusion and modal-guided count enhancement to achieve
the impressive performance. The proposed count-guided multi-modal fusion module
utilizes a multi-scale token transformer to interact two-modal information
under the guidance of count information and perceive different scales from the
token perspective. The proposed modal-guided count enhancement module employs
multi-scale deformable transformer decoder structure to enhance one modality
feature and count information by the other modality. Experiment in public
RGBT-CC dataset shows that our method refreshes the state-of-the-art results.
https://github.com/liuzywen/RGBTCC
Related papers
- Multi-modal Crowd Counting via Modal Emulation [41.959740205234446]
We propose a modal emulation-based two-pass multi-modal crowd-counting framework.
Framework consists of two key components: a emphmulti-modal inference pass and a emphcross-modal emulation pass.
Experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods.
arXiv Detail & Related papers (2024-07-28T13:14:57Z) - Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images.
We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem.
We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting [40.4816930622052]
We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet)
In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion.
Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
arXiv Detail & Related papers (2022-08-14T02:42:09Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Towards Reliable Image Outpainting: Learning Structure-Aware Multimodal
Fusion with Depth Guidance [49.94504248096527]
We propose a Depth-Guided Outpainting Network (DGONet) to model the feature representations of different modalities.
Two components are designed to implement: 1) The Multimodal Learning Module produces unique depth and RGB feature representations from perspectives of different modal characteristics.
We specially design an additional constraint strategy consisting of Cross-modal Loss and Edge Loss to enhance ambiguous contours and expedite reliable content generation.
arXiv Detail & Related papers (2022-04-12T06:06:50Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT
Benchmark for Crowd Counting [109.32927895352685]
We introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people.
To facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework.
Experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting.
arXiv Detail & Related papers (2020-12-08T16:18:29Z) - Multi-interactive Dual-decoder for RGB-thermal Salient Object Detection [37.79290349045164]
RGB-thermal salient object detection (SOD) aims to segment the common prominent regions of visible image and corresponding thermal infrared image.
Existing methods don't fully explore and exploit the potentials of complementarity of different modalities and multi-type cues of image contents.
We propose a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGBT SOD.
arXiv Detail & Related papers (2020-05-05T16:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.