Related papers: CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features

URL: http://arxiv.org/abs/2505.19434v1
Date: Mon, 26 May 2025 02:53:12 GMT
Title: CSTrack: Enhancing RGB-X Tracking via Compact Spatiotemporal Features
Authors: X. Feng, D. Zhang, S. Hu, X. Li, M. Wu, J. Zhang, X. Chen, K. Huang,
Abstract summary: CSTrack focuses on modeling Compact Stemporal features to achieve simple yet effective tracking.<n>New method achieves new SOTA results on mainstream RGB-X benchmarks.
Score: 0.6143225301480709
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Effectively modeling and utilizing spatiotemporal features from RGB and other modalities (\eg, depth, thermal, and event data, denoted as X) is the core of RGB-X tracker design. Existing methods often employ two parallel branches to separately process the RGB and X input streams, requiring the model to simultaneously handle two dispersed feature spaces, which complicates both the model structure and computation process. More critically, intra-modality spatial modeling within each dispersed space incurs substantial computational overhead, limiting resources for inter-modality spatial modeling and temporal modeling. To address this, we propose a novel tracker, CSTrack, which focuses on modeling Compact Spatiotemporal features to achieve simple yet effective tracking. Specifically, we first introduce an innovative Spatial Compact Module that integrates the RGB-X dual input streams into a compact spatial feature, enabling thorough intra- and inter-modality spatial modeling. Additionally, we design an efficient Temporal Compact Module that compactly represents temporal features by constructing the refined target distribution heatmap. Extensive experiments validate the effectiveness of our compact spatiotemporal modeling method, with CSTrack achieving new SOTA results on mainstream RGB-X benchmarks. The code and models will be released at: https://github.com/XiaokunFeng/CSTrack.

Related papers

Minimal Convolutional RNNs Accelerate Spatiotemporal Learning [4.918567856499736]
We introduce MinConvLSTM and MinConvGRU, twotemporal models that combine the spatial biases of convolutional recurrent networks with the training efficiency of minimal, parallel RNNs.<n>Our models are structurally minimal computation and computationally efficient, with reduced parameter count and improved scalability.
arXiv Detail & Related papers (2025-08-05T16:28:43Z)
Lightweight RGB-D Salient Object Detection from a Speed-Accuracy Tradeoff Perspective [54.91271106816616]
Current RGB-D methods usually leverage large-scale backbones to improve accuracy but sacrifice efficiency.<n>We propose a Speed-Accuracy Tradeoff Network (SATNet) for Lightweight RGB-D SOD from three fundamental perspectives.<n> Concerning depth quality, we introduce the Depth Anything Model to generate high-quality depth maps.<n>For modality fusion, we propose a Decoupled Attention Module (DAM) to explore the consistency within and between modalities.<n>For feature representation, we develop a Dual Information Representation Module (DIRM) with a bi-directional inverted framework.
arXiv Detail & Related papers (2025-05-07T19:37:20Z)
LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking [4.963745612929956]
LightFC-X is a family of lightweight convolutional-X trackers for multimodal tracking.<n>LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed.
arXiv Detail & Related papers (2025-02-25T12:10:33Z)
Cross Space and Time: A Spatio-Temporal Unitized Model for Traffic Flow Forecasting [16.782154479264126]
Predicting backbone-temporal traffic flow presents challenges due to complex interactions between temporal factors. Existing approaches address these dimensions in isolation, neglecting their critical interdependencies. In this paper, we introduce Sanonymous-Temporal Unitized Unitized Cell (ASTUC), a unified framework designed to capture both spatial and temporal dependencies.
arXiv Detail & Related papers (2024-11-14T07:34:31Z)
MambaVT: Spatio-Temporal Contextual Modeling for robust RGB-T Tracking [51.28485682954006]
We propose a pure Mamba-based framework (MambaVT) to fully exploit intrinsic-temporal contextual modeling for robust visible-thermal tracking. Specifically, we devise the long-range cross-frame integration component to globally adapt to target appearance variations. Experiments show the significant potential of vision Mamba for RGB-T tracking, with MambaVT achieving state-of-the-art performance on four mainstream benchmarks.
arXiv Detail & Related papers (2024-08-15T02:29:00Z)
Transformer-based RGB-T Tracking with Channel and Spatial Feature Fusion [12.982885009492389]
We show how to improve the performance of a visual Transformer by using direct fusion of cross-modal channels and spatial features. CSTNet achieves state-of-the-art performance on three public RGB-T tracking benchmarks.
arXiv Detail & Related papers (2024-05-06T05:58:49Z)
WCCNet: Wavelet-integrated CNN with Crossmodal Rearranging Fusion for Fast Multispectral Pedestrian Detection [16.43119521684829]
We propose a novel framework named WCCNet that is able to differentially extract rich features of different spectra with lower computational complexity. Based on the well extracted features, we elaborately design the crossmodal rearranging fusion module (CMRF) We conduct comprehensive evaluations on KAIST and FLIR benchmarks, in which WCCNet outperforms state-of-the-art methods with considerable computational efficiency and competitive accuracy.
arXiv Detail & Related papers (2023-08-02T09:35:21Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
Decoupling and Recoupling Spatiotemporal Representation for RGB-D-based Motion Recognition [62.46544616232238]
Previous motion recognition methods have achieved promising performance through the tightly coupled multi-temporal representation. We propose to decouple and recouple caused caused representation for RGB-D-based motion recognition.
arXiv Detail & Related papers (2021-12-16T18:59:47Z)
Spatial-Separated Curve Rendering Network for Efficient and High-Resolution Image Harmonization [59.19214040221055]
We propose a novel spatial-separated curve rendering network (S$2$CRNet) for efficient and high-resolution image harmonization. The proposed method reduces more than 90% parameters compared with previous methods. Our method can work smoothly on higher resolution images in real-time which is more than 10$times$ faster than the existing methods.
arXiv Detail & Related papers (2021-09-13T07:20:16Z)
Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities. When the appearance cue is unreliable, we take motion cues into account to make the tracker robust. Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)
Disentangling and Unifying Graph Convolutions for Skeleton-Based Action Recognition [79.33539539956186]
We propose a simple method to disentangle multi-scale graph convolutions and a unified spatial-temporal graph convolutional operator named G3D. By coupling these proposals, we develop a powerful feature extractor named MS-G3D based on which our model outperforms previous state-of-the-art methods on three large-scale datasets.
arXiv Detail & Related papers (2020-03-31T11:28:25Z)
STH: Spatio-Temporal Hybrid Convolution for Efficient Action Recognition [39.58542259261567]
We present a novel S-Temporal Hybrid Network (STH) which simultaneously encodes spatial and temporal video information with a small parameter. Such a design enables efficient-temporal modeling and maintains a small model scale. STH enjoys performance superiority over 3D CNNs while maintaining an even smaller parameter cost than 2D CNNs.
arXiv Detail & Related papers (2020-03-18T04:46:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.