LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking
- URL: http://arxiv.org/abs/2502.18143v1
- Date: Tue, 25 Feb 2025 12:10:33 GMT
- Title: LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking
- Authors: Yunfeng Li, Bo Wang, Ye Li,
- Abstract summary: LightFC-X is a family of lightweight convolutional-X trackers for multimodal tracking.<n>LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed.
- Score: 4.963745612929956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
Related papers
- Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.
Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.
We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - Two-stream Beats One-stream: Asymmetric Siamese Network for Efficient Visual Tracking [54.124445709376154]
We propose a novel asymmetric Siamese tracker named textbfAsymTrack for efficient tracking.
Building on this architecture, we devise an efficient template modulation mechanism to inject crucial cues into the search features.
Experiments demonstrate that AsymTrack offers superior speed-precision trade-offs across different platforms.
arXiv Detail & Related papers (2025-03-01T14:44:54Z) - Light-T2M: A Lightweight and Fast Model for Text-to-motion Generation [30.05431858162078]
Text-to-motion (T2M) generation plays a significant role in various applications.
Current methods involve a large number of parameters and suffer from slow inference speeds.
We propose a lightweight and fast model named Light-T2M to reduce usage costs.
arXiv Detail & Related papers (2024-12-15T13:58:37Z) - Coarse Correspondences Boost Spatial-Temporal Reasoning in Multimodal Language Model [51.83436609094658]
We introduce Coarse Correspondences, a simple lightweight method that enhances MLLMs' spatial-temporal reasoning with 2D images as input.
Our method uses a lightweight tracking model to identify primary object correspondences between frames in a video or across different image viewpoints.
We demonstrate that this simple training-free approach brings substantial gains to GPT4-V/O consistently on four benchmarks.
arXiv Detail & Related papers (2024-08-01T17:57:12Z) - Mamba-FETrack: Frame-Event Tracking via State Space Model [14.610806117193116]
This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM)
Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams.
Extensive experiments on FELT and FE108 datasets fully validated the efficiency and effectiveness of our proposed tracker.
arXiv Detail & Related papers (2024-04-28T13:12:49Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.
DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.
Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Lightweight Full-Convolutional Siamese Tracker [4.903759699116597]
This paper proposes a lightweight full-convolutional Siamese tracker called LightFC.
LightFC employs a novel efficient cross-correlation module and a novel efficient rep-center head.
Experiments show that LightFC achieves the optimal balance between performance, parameters, Flops and FPS.
arXiv Detail & Related papers (2023-10-09T04:07:35Z) - Exploring Lightweight Hierarchical Vision Transformers for Efficient
Visual Tracking [69.89887818921825]
HiT is a new family of efficient tracking models that can run at high speed on different devices.
HiT achieves 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers.
arXiv Detail & Related papers (2023-08-14T02:51:34Z) - Parameter-efficient Tuning of Large-scale Multimodal Foundation Model [68.24510810095802]
We propose A graceful prompt framework for cross-modal transfer (Aurora) to overcome these challenges.
Considering the redundancy in existing architectures, we first utilize the mode approximation to generate 0.1M trainable parameters to implement the multimodal prompt tuning.
A thorough evaluation on six cross-modal benchmarks shows that it not only outperforms the state-of-the-art but even outperforms the full fine-tuning approach.
arXiv Detail & Related papers (2023-05-15T06:40:56Z) - FEAR: Fast, Efficient, Accurate and Robust Visual Tracker [2.544539499281093]
We present FEAR, a novel, fast, efficient, accurate, and robust Siamese visual tracker.
FEAR-XS tracker is 2.4x smaller and 4.3x faster than LightTrack [62] with superior accuracy.
arXiv Detail & Related papers (2021-12-15T08:28:55Z) - Jointly Modeling Motion and Appearance Cues for Robust RGB-T Tracking [85.333260415532]
We develop a novel late fusion method to infer the fusion weight maps of both RGB and thermal (T) modalities.
When the appearance cue is unreliable, we take motion cues into account to make the tracker robust.
Numerous results on three recent RGB-T tracking datasets show that the proposed tracker performs significantly better than other state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-04T08:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.