Strip-Fusion: Spatiotemporal Fusion for Multispectral Pedestrian Detection
- URL: http://arxiv.org/abs/2601.18008v1
- Date: Sun, 25 Jan 2026 21:58:07 GMT
- Title: Strip-Fusion: Spatiotemporal Fusion for Multispectral Pedestrian Detection
- Authors: Asiegbu Miracle Kanu-Asiegbu, Nitin Jotwani, Xiaoxiao Du,
- Abstract summary: Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information.<n>Existing approaches primarily focus on spatial fusion and often neglect temporal information.<n>This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images.
- Score: 0.27528170226206433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pedestrian detection is a critical task in robot perception. Multispectral modalities (visible light and thermal) can boost pedestrian detection performance by providing complementary visual information. Several gaps remain with multispectral pedestrian detection methods. First, existing approaches primarily focus on spatial fusion and often neglect temporal information. Second, RGB and thermal image pairs in multispectral benchmarks may not always be perfectly aligned. Pedestrians are also challenging to detect due to varying lighting conditions, occlusion, etc. This work proposes Strip-Fusion, a spatial-temporal fusion network that is robust to misalignment in input images, as well as varying lighting conditions and heavy occlusions. The Strip-Fusion pipeline integrates temporally adaptive convolutions to dynamically weigh spatial-temporal features, enabling our model to better capture pedestrian motion and context over time. A novel Kullback-Leibler divergence loss was designed to mitigate modality imbalance between visible and thermal inputs, guiding feature alignment toward the more informative modality during training. Furthermore, a novel post-processing algorithm was developed to reduce false positives. Extensive experimental results show that our method performs competitively for both the KAIST and the CVC-14 benchmarks. We also observed significant improvements compared to previous state-of-the-art on challenging conditions such as heavy occlusion and misalignment.
Related papers
- Contrast-Guided Cross-Modal Distillation for Thermal Object Detection [1.8477401359673709]
Low contrast and weak high-frequency cues lead to duplicate, overlapping boxes, missed small objects, and class confusion.<n>We introduce training-only objectives that sharpen instance-level decision boundaries by pulling together features of the same class.<n>In experiments, our method outperformed prior approaches and achieved state-of-the-art performance.
arXiv Detail & Related papers (2025-11-03T10:38:01Z) - Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z) - Transformer-Based Dual-Optical Attention Fusion Crowd Head Point Counting and Localization Network [9.214772627896156]
The model designs a dual-optical attention fusion module (DAFP) by introducing complementary information from infrared images.<n>The proposed method outperforms existing techniques in terms of performance, especially in challenging dense low-light scenes.
arXiv Detail & Related papers (2025-05-11T10:55:14Z) - MambaST: A Plug-and-Play Cross-Spectral Spatial-Temporal Fuser for Efficient Pedestrian Detection [0.5898893619901381]
This paper proposes MambaST, a plug-and-play cross-spectral spatial-temporal fusion pipeline for efficient pedestrian detection.
It is difficult to perform accurate detection using RGB cameras under dark or low-light conditions.
Our proposed model also achieves superior performance on small-scale pedestrian detection.
arXiv Detail & Related papers (2024-08-02T06:20:48Z) - Beyond Night Visibility: Adaptive Multi-Scale Fusion of Infrared and
Visible Images [49.75771095302775]
We propose an Adaptive Multi-scale Fusion network (AMFusion) with infrared and visible images.
First, we separately fuse spatial and semantic features from infrared and visible images, where the former are used for the adjustment of light distribution.
Second, we utilize detection features extracted by a pre-trained backbone that guide the fusion of semantic features.
Third, we propose a new illumination loss to constrain fusion image with normal light intensity.
arXiv Detail & Related papers (2024-03-02T03:52:07Z) - Graph Spatiotemporal Process for Multivariate Time Series Anomaly
Detection with Missing Values [67.76168547245237]
We introduce a novel framework called GST-Pro, which utilizes a graphtemporal process and anomaly scorer to detect anomalies.
Our experimental results show that the GST-Pro method can effectively detect anomalies in time series data and outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-01-11T10:10:16Z) - Factorized Inverse Path Tracing for Efficient and Accurate
Material-Lighting Estimation [97.0195314255101]
Inverse path tracing is expensive to compute, and ambiguities exist between reflection and emission.
Our Factorized Inverse Path Tracing (FIPT) addresses these challenges by using a factored light transport formulation.
Our algorithm enables accurate material and lighting optimization faster than previous work, and is more effective at resolving ambiguities.
arXiv Detail & Related papers (2023-04-12T07:46:05Z) - Breaking Modality Disparity: Harmonized Representation for Infrared and
Visible Image Registration [66.33746403815283]
We propose a scene-adaptive infrared and visible image registration.
We employ homography to simulate the deformation between different planes.
We propose the first ground truth available misaligned infrared and visible image dataset.
arXiv Detail & Related papers (2023-04-12T06:49:56Z) - ReDFeat: Recoupling Detection and Description for Multimodal Feature
Learning [51.07496081296863]
We recouple independent constraints of detection and description of multimodal feature learning with a mutual weighting strategy.
We propose a detector that possesses a large receptive field and is equipped with learnable non-maximum suppression layers.
We build a benchmark that contains cross visible, infrared, near-infrared and synthetic aperture radar image pairs for evaluating the performance of features in feature matching and image registration tasks.
arXiv Detail & Related papers (2022-05-16T04:24:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.