Related papers: Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection

Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection

URL: http://arxiv.org/abs/2504.11779v1
Date: Wed, 16 Apr 2025 05:32:59 GMT
Title: Multimodal Spatio-temporal Graph Learning for Alignment-free RGBT Video Object Detection
Authors: Qishun Wang, Zhengzheng Tu, Chenglong Li, Bo Jiang,
Abstract summary: RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions.<n>We propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem.
Score: 13.682115079677466
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RGB-Thermal Video Object Detection (RGBT VOD) can address the limitation of traditional RGB-based VOD in challenging lighting conditions, making it more practical and effective in many applications. However, similar to most RGBT fusion tasks, it still mainly relies on manually aligned multimodal image pairs. In this paper, we propose a novel Multimodal Spatio-temporal Graph learning Network (MSGNet) for alignment-free RGBT VOD problem by leveraging the robust graph representation learning model. Specifically, we first design an Adaptive Partitioning Layer (APL) to estimate the corresponding regions of the Thermal image within the RGB image (high-resolution), achieving a preliminary inexact alignment. Then, we introduce the Spatial Sparse Graph Learning Module (S-SGLM) which employs a sparse information passing mechanism on the estimated inexact alignment to achieve reliable information interaction between different modalities. Moreover, to fully exploit the temporal cues for RGBT VOD problem, we introduce Hybrid Structured Temporal Modeling (HSTM), which involves a Temporal Sparse Graph Learning Module (T-SGLM) and Temporal Star Block (TSB). T-SGLM aims to filter out some redundant information between adjacent frames by employing the sparse aggregation mechanism on the temporal graph. Meanwhile, TSB is dedicated to achieving the complementary learning of local spatial relationships. Extensive comparative experiments conducted on both the aligned dataset VT-VOD50 and the unaligned dataset UVT-VOD2024 demonstrate the effectiveness and superiority of our proposed method. Our project will be made available on our website for free public access.

Related papers

Breaking Alignment Barriers: TPS-Driven Semantic Correlation Learning for Alignment-Free RGB-T Salient Object Detection [34.62005077259452]
Existing RGB-T salient object detection methods rely on manually aligned and annotated datasets.<n>We propose an efficient RGB-T SOD method for real-world unaligned image pairs, termed Thin-Plate Spline-driven Semantic Correlation Learning Network (TPS-SCL)<n>TPS-SCL attains state-of-the-art (SOTA) performance among existing lightweight SOD methods and outperforms mainstream RGB-T SOD approaches.
arXiv Detail & Related papers (2025-12-26T04:37:49Z)
STRGCN: Capturing Asynchronous Spatio-Temporal Dependencies for Irregular Multivariate Time Series Forecasting [14.156419219696252]
STRGCN captures the complex interdependencies in IMTS by representing them as a fully connected graph.<n>Experiments on four public datasets demonstrate that STRGCN achieves state-of-the-art accuracy, competitive memory usage and training speed.
arXiv Detail & Related papers (2025-05-07T06:41:33Z)
Unveiling the Limits of Alignment: Multi-modal Dynamic Local Fusion Network and A Benchmark for Unaligned RGBT Video Object Detection [5.068440399797739]
Current RGB-Thermal Video Object Detection (RGBT VOD) methods depend on manually aligning data at the image level. We propose a Multi-modal Dynamic Local fusion Network (MDLNet) designed to handle unaligned RGBT pairs. We conduct a comprehensive evaluation and comparison with MDLNet and state-of-the-art (SOTA) models, demonstrating the superior effectiveness of MDLNet.
arXiv Detail & Related papers (2024-10-16T01:06:12Z)
Enhancing Video-Language Representations with Structural Spatio-Temporal Alignment [130.15775113897553]
Finsta is a fine-grained structural-temporal alignment learning method. It consistently improves the existing 13 strong-tuning video-language models.
arXiv Detail & Related papers (2024-06-27T15:23:36Z)
Fully-Connected Spatial-Temporal Graph for Multivariate Time-Series Data [50.84488941336865]
We propose a novel method called Fully- Spatial-Temporal Graph Neural Network (FC-STGNN) For graph construction, we design a decay graph to connect sensors across all timestamps based on their temporal distances. For graph convolution, we devise FC graph convolution with a moving-pooling GNN layer to effectively capture the ST dependencies for learning effective representations.
arXiv Detail & Related papers (2023-09-11T08:44:07Z)
Erasure-based Interaction Network for RGBT Video Object Detection and A Unified Benchmark [9.979933455242774]
This work introduces a new computer vision task called RGB-thermal (RGBT) VOD. Traditional Video Object Detection (VOD) methods often leverage temporal information. We develop a negative activation function that is used to erase the noise of RGB features with the help of thermal image features.
arXiv Detail & Related papers (2023-08-03T09:04:48Z)
A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition [24.02488085447691]
We introduce a novel video data augmentation method dubbed ShuffleMix, which acts as a supplement to MixUp, to provide additional temporal regularization for motion recognition. Secondly, a Unified Multimodal De-coupling and multi-stage Re-coupling framework, termed UMDR, is proposed for video representation learning.
arXiv Detail & Related papers (2022-11-16T19:00:23Z)
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation. We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z)
TCGL: Temporal Contrastive Graph for Self-supervised Video Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL) Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG) To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z)
GACAN: Graph Attention-Convolution-Attention Networks for Traffic Forecasting Based on Multi-granularity Time Series [9.559635281384134]
We propose Graph Attention-Convolution-Attention Networks (GACAN) for traffic forecasting. The model uses a novel Att-Conv-Att block which contains two graph attention layers and one spectral-based GCN layer sandwiched in between. The main novelty of the model is the integration of time series of four different time granularities.
arXiv Detail & Related papers (2021-10-27T10:21:13Z)
Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Temporal Contrastive Graph Learning for Video Action Recognition and Retrieval [83.56444443849679]
This work takes advantage of the temporal dependencies within videos and proposes a novel self-supervised method named Temporal Contrastive Graph Learning (TCGL) Our TCGL roots in a hybrid graph contrastive learning strategy to jointly regard the inter-snippet and intra-snippet temporal dependencies as self-supervision signals for temporal representation learning. Experimental results demonstrate the superiority of our TCGL over the state-of-the-art methods on large-scale action recognition and video retrieval benchmarks.
arXiv Detail & Related papers (2021-01-04T08:11:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.