Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking
- URL: http://arxiv.org/abs/2503.11247v1
- Date: Fri, 14 Mar 2025 09:56:13 GMT
- Title: Breaking Shallow Limits: Task-Driven Pixel Fusion for Gap-free RGBT Tracking
- Authors: Andong Lu, Yuanzhi Guo, Wanyu Wang, Chenglong Li, Jin Tang, Bin Luo,
- Abstract summary: Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap.<n>We propose a novel textbfTask-driven textbfPixel-level textbfFusion network, named textbfTPF.<n>In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking.
- Score: 21.18680957184296
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Current RGBT tracking methods often overlook the impact of fusion location on mitigating modality gap, which is key factor to effective tracking. Our analysis reveals that shallower fusion yields smaller distribution gap. However, the limited discriminative power of shallow networks hard to distinguish task-relevant information from noise, limiting the potential of pixel-level fusion. To break shallow limits, we propose a novel \textbf{T}ask-driven \textbf{P}ixel-level \textbf{F}usion network, named \textbf{TPF}, which unveils the power of pixel-level fusion in RGBT tracking through a progressive learning framework. In particular, we design a lightweight Pixel-level Fusion Adapter (PFA) that exploits Mamba's linear complexity to ensure real-time, low-latency RGBT tracking. To enhance the fusion capabilities of the PFA, our task-driven progressive learning framework first utilizes adaptive multi-expert distillation to inherits fusion knowledge from state-of-the-art image fusion models, establishing robust initialization, and then employs a decoupled representation learning scheme to achieve task-relevant information fusion. Moreover, to overcome appearance variations between the initial template and search frames, we presents a nearest-neighbor dynamic template updating scheme, which selects the most reliable frame closest to the current search frame as the dynamic template. Extensive experiments demonstrate that TPF significantly outperforms existing most of advanced trackers on four public RGBT tracking datasets. The code will be released upon acceptance.
Related papers
- FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.
We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)
PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.
FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z) - Multi-Scale Cross-Fusion and Edge-Supervision Network for Image Splicing Localization [13.776343759641343]
We propose a multi-scale cross-fusion and edge-supervision network for ISL.<n>Our framework consists of three key steps: multi-scale features cross-fusion, edge mask prediction and edge-supervision localization.<n>Our proposed method is superior to state-of-the-art schemes.
arXiv Detail & Related papers (2024-12-17T03:10:04Z) - Unsupervised Hyperspectral and Multispectral Image Fusion via Self-Supervised Modality Decoupling [16.923899489263945]
Hyperspectral and Multispectral Image Fusion (HMIF) aims to fuse low-resolution hyperspectral images (LR-HSIs) and high-resolution multispectral images (HR-MSIs)
HMIF methods typically apply direct fusion from the two modalities without effective supervision.
We propose an end-to-end self-supervised textbfModality-Decoupled textbfSpatial-textbfSpectral Fusion (textbfMossFuse) framework.
arXiv Detail & Related papers (2024-12-06T06:22:43Z) - GraFT: Gradual Fusion Transformer for Multimodal Re-Identification [0.8999666725996975]
We introduce the textbfGradual Fusion Transformer (GraFT) for multimodal ReID.
GraFT employs learnable fusion tokens that guide self-attention across encoders, adeptly capturing both modality-specific and object-specific features.
We demonstrate these enhancements through extensive ablation studies and show that GraFT consistently surpasses established multimodal ReID benchmarks.
arXiv Detail & Related papers (2023-10-25T00:15:40Z) - GIFD: A Generative Gradient Inversion Method with Feature Domain
Optimization [52.55628139825667]
Federated Learning (FL) has emerged as a promising distributed machine learning framework to preserve clients' privacy.
Recent studies find that an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge.
We propose textbfGradient textbfInversion over textbfFeature textbfDomains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers.
arXiv Detail & Related papers (2023-08-09T04:34:21Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - A Task-guided, Implicitly-searched and Meta-initialized Deep Model for
Image Fusion [69.10255211811007]
We present a Task-guided, Implicit-searched and Meta- generalizationd (TIM) deep model to address the image fusion problem in a challenging real-world scenario.
Specifically, we propose a constrained strategy to incorporate information from downstream tasks to guide the unsupervised learning process of image fusion.
Within this framework, we then design an implicit search scheme to automatically discover compact architectures for our fusion model with high efficiency.
arXiv Detail & Related papers (2023-05-25T08:54:08Z) - Searching a Compact Architecture for Robust Multi-Exposure Image Fusion [55.37210629454589]
Two major stumbling blocks hinder the development, including pixel misalignment and inefficient inference.
This study introduces an architecture search-based paradigm incorporating self-alignment and detail repletion modules for robust multi-exposure image fusion.
The proposed method outperforms various competitive schemes, achieving a noteworthy 3.19% improvement in PSNR for general scenarios and an impressive 23.5% enhancement in misaligned scenarios.
arXiv Detail & Related papers (2023-05-20T17:01:52Z) - CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for
Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network.
We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z) - Exploring Fusion Strategies for Accurate RGBT Visual Object Tracking [1.015785232738621]
We address the problem of multi-modal object tracking in video.
We explore various options of fusing the complementary information conveyed by the visible (RGB) and thermal infrared (TIR) modalities.
arXiv Detail & Related papers (2022-01-21T12:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.