Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT
Benchmark for Crowd Counting
- URL: http://arxiv.org/abs/2012.04529v2
- Date: Tue, 6 Apr 2021 03:02:31 GMT
- Title: Cross-Modal Collaborative Representation Learning and a Large-Scale RGBT
Benchmark for Crowd Counting
- Authors: Lingbo Liu, Jiaqi Chen, Hefeng Wu, Guanbin Li, Chenglong Li, Liang Lin
- Abstract summary: We introduce a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030 pairs of RGB-thermal images with 138,389 annotated people.
To facilitate the multimodal crowd counting, we propose a cross-modal collaborative representation learning framework.
Experiments conducted on the RGBT-CC benchmark demonstrate the effectiveness of our framework for RGBT crowd counting.
- Score: 109.32927895352685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Crowd counting is a fundamental yet challenging task, which desires rich
information to generate pixel-wise crowd density maps. However, most previous
methods only used the limited information of RGB images and cannot well
discover potential pedestrians in unconstrained scenarios. In this work, we
find that incorporating optical and thermal information can greatly help to
recognize pedestrians. To promote future researches in this field, we introduce
a large-scale RGBT Crowd Counting (RGBT-CC) benchmark, which contains 2,030
pairs of RGB-thermal images with 138,389 annotated people. Furthermore, to
facilitate the multimodal crowd counting, we propose a cross-modal
collaborative representation learning framework, which consists of multiple
modality-specific branches, a modality-shared branch, and an Information
Aggregation-Distribution Module (IADM) to capture the complementary information
of different modalities fully. Specifically, our IADM incorporates two
collaborative information transfers to dynamically enhance the modality-shared
and modality-specific representations with a dual information propagation
mechanism. Extensive experiments conducted on the RGBT-CC benchmark demonstrate
the effectiveness of our framework for RGBT crowd counting. Moreover, the
proposed approach is universal for multimodal crowd counting and is also
capable to achieve superior performance on the ShanghaiTechRGBD dataset.
Finally, our source code and benchmark are released at
{\url{http://lingboliu.com/RGBT_Crowd_Counting.html}}.
Related papers
- Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images.
We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem.
We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z) - Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching.
An anchor branch is first trained to provide insights into the data properties.
A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z) - RGBT Tracking via Progressive Fusion Transformer with Dynamically Guided
Learning [37.067605349559]
We propose a novel Progressive Fusion Transformer called ProFormer.
It integrates single-modality information into the multimodal representation for robust RGBT tracking.
ProFormer sets a new state-of-the-art performance on RGBT210, RGBT234, LasHeR, and VTUAV datasets.
arXiv Detail & Related papers (2023-03-26T16:55:58Z) - RGB-T Multi-Modal Crowd Counting Based on Transformer [8.870454119294003]
We use count-guided multi-modal fusion and modal-guided count enhancement to achieve the impressive performance.
Experiment in public RGBT-CC dataset shows that our method refreshes the state-of-the-art results.
arXiv Detail & Related papers (2023-01-08T12:59:52Z) - CLIP-Driven Fine-grained Text-Image Person Re-identification [50.94827165464813]
TIReID aims to retrieve the image corresponding to the given text query from a pool of candidate images.
We propose a CLIP-driven Fine-grained information excavation framework (CFine) to fully utilize the powerful knowledge of CLIP for TIReID.
arXiv Detail & Related papers (2022-10-19T03:43:12Z) - MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting [40.4816930622052]
We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet)
In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion.
Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
arXiv Detail & Related papers (2022-08-14T02:42:09Z) - RGBT Tracking via Multi-Adapter Network with Hierarchical Divergence
Loss [37.99375824040946]
We propose a novel multi-adapter network to jointly perform modality-shared, modality-specific and instance-aware target representation learning.
Experiments on two RGBT tracking benchmark datasets demonstrate the outstanding performance of the proposed tracker.
arXiv Detail & Related papers (2020-11-14T01:50:46Z) - Searching Multi-Rate and Multi-Modal Temporal Enhanced Networks for
Gesture Recognition [89.0152015268929]
We propose the first neural architecture search (NAS)-based method for RGB-D gesture recognition.
The proposed method includes two key components: 1) enhanced temporal representation via the 3D Central Difference Convolution (3D-CDC) family, and optimized backbones for multi-modal-rate branches and lateral connections.
The resultant multi-rate network provides a new perspective to understand the relationship between RGB and depth modalities and their temporal dynamics.
arXiv Detail & Related papers (2020-08-21T10:45:09Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Multi-interactive Dual-decoder for RGB-thermal Salient Object Detection [37.79290349045164]
RGB-thermal salient object detection (SOD) aims to segment the common prominent regions of visible image and corresponding thermal infrared image.
Existing methods don't fully explore and exploit the potentials of complementarity of different modalities and multi-type cues of image contents.
We propose a multi-interactive dual-decoder to mine and model the multi-type interactions for accurate RGBT SOD.
arXiv Detail & Related papers (2020-05-05T16:21:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.