Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking
- URL: http://arxiv.org/abs/2601.01022v1
- Date: Sat, 03 Jan 2026 01:10:17 GMT
- Title: Decoupling Amplitude and Phase Attention in Frequency Domain for RGB-Event based Visual Object Tracking
- Authors: Shiao Wang, Xiao Wang, Haonan Zhao, Jiarui Xu, Bo Jiang, Lin Zhu, Xin Zhao, Yonghong Tian, Jin Tang,
- Abstract summary: Existing RGB-Event tracking approaches fail to fully exploit the unique advantages of event cameras.<n>We propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality.<n>Experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method.
- Score: 51.31378940976401
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing RGB-Event visual object tracking approaches primarily rely on conventional feature-level fusion, failing to fully exploit the unique advantages of event cameras. In particular, the high dynamic range and motion-sensitive nature of event cameras are often overlooked, while low-information regions are processed uniformly, leading to unnecessary computational overhead for the backbone network. To address these issues, we propose a novel tracking framework that performs early fusion in the frequency domain, enabling effective aggregation of high-frequency information from the event modality. Specifically, RGB and event modalities are transformed from the spatial domain to the frequency domain via the Fast Fourier Transform, with their amplitude and phase components decoupled. High-frequency event information is selectively fused into RGB modality through amplitude and phase attention, enhancing feature representation while substantially reducing backbone computation. In addition, a motion-guided spatial sparsification module leverages the motion-sensitive nature of event cameras to capture the relationship between target motion cues and spatial probability distribution, filtering out low-information regions and enhancing target-relevant features. Finally, a sparse set of target-relevant features is fed into the backbone network for learning, and the tracking head predicts the final target position. Extensive experiments on three widely used RGB-Event tracking benchmark datasets, including FE108, FELT, and COESOT, demonstrate the high performance and efficiency of our method. The source code of this paper will be released on https://github.com/Event-AHU/OpenEvTracking
Related papers
- Adversarial Attack for RGB-Event based Visual Object Tracking [39.35874495297647]
We propose a cross-modal adversarial attack algorithm for RGB-Event visual tracking.<n>We evaluate the proposed approach against attacks on three widely used RGB-Event Tracking datasets.
arXiv Detail & Related papers (2025-04-19T23:35:19Z) - FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies [59.10607296658509]
Event cameras offer unparalleled advantages for real-time perception in dynamic environments.<n>Event detectors are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data.<n>We propose FlexEvent, a novel framework that enables detection at varying frequencies.
arXiv Detail & Related papers (2024-12-09T17:57:14Z) - Frequency-Adaptive Low-Latency Object Detection Using Events and Frames [23.786369609995013]
Fusing Events and RGB images for object detection leverages the robustness of Event cameras in adverse environments.<n>Two critical mismatches: low-latency Events textitvs.high-latency RGB frames, and temporally sparse labels in training textitvs.continuous flow in inference.<n>We propose the textbfFrequency-textbfAdaptive Low-Latency textbfObject textbfDetector (FAOD)
arXiv Detail & Related papers (2024-12-05T13:23:06Z) - Dynamic Subframe Splitting and Spatio-Temporal Motion Entangled Sparse Attention for RGB-E Tracking [32.86991031493605]
Event-based bionic camera captures dynamic scenes with high temporal resolution and high dynamic range.
We propose a dynamic event subframe splitting strategy to split the event stream into more fine-grained event clusters.
Based on this, we design an event-based sparse attention mechanism to enhance the interaction of event features in temporal and spatial dimensions.
arXiv Detail & Related papers (2024-09-26T06:12:08Z) - TENet: Targetness Entanglement Incorporating with Multi-Scale Pooling and Mutually-Guided Fusion for RGB-E Object Tracking [30.89375068036783]
Existing approaches perform event feature extraction for RGB-E tracking using traditional appearance models.
We propose an Event backbone (Pooler) to obtain a high-quality feature representation that is cognisant of the intrinsic characteristics of the event data.
Our method significantly outperforms state-of-the-art trackers on two widely used RGB-E tracking datasets.
arXiv Detail & Related papers (2024-05-08T12:19:08Z) - Implicit Event-RGBD Neural SLAM [54.74363487009845]
Implicit neural SLAM has achieved remarkable progress recently.
Existing methods face significant challenges in non-ideal scenarios.
We propose EN-SLAM, the first event-RGBD implicit neural SLAM framework.
arXiv Detail & Related papers (2023-11-18T08:48:58Z) - SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features [52.213656737672935]
SpikeMOT is an event-based multi-object tracker.
SpikeMOT uses spiking neural networks to extract sparsetemporal features from event streams associated with objects.
arXiv Detail & Related papers (2023-09-29T05:13:43Z) - Frequency Perception Network for Camouflaged Object Detection [51.26386921922031]
We propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain.<n>Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.<n>Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets.
arXiv Detail & Related papers (2023-08-17T11:30:46Z) - Dual Memory Aggregation Network for Event-Based Object Detection with
Learnable Representation [79.02808071245634]
Event-based cameras are bio-inspired sensors that capture brightness change of every pixel in an asynchronous manner.
Event streams are divided into grids in the x-y-t coordinates for both positive and negative polarity, producing a set of pillars as 3D tensor representation.
Long memory is encoded in the hidden state of adaptive convLSTMs while short memory is modeled by computing spatial-temporal correlation between event pillars.
arXiv Detail & Related papers (2023-03-17T12:12:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.