Related papers: TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events

URL: http://arxiv.org/abs/2603.04989v1
Date: Thu, 05 Mar 2026 09:32:24 GMT
Title: TAPFormer: Robust Arbitrary Point Tracking via Transient Asynchronous Fusion of Frames and Events
Authors: Jiaxiong Liu, Zhen Tan, Jinpu Zhang, Yi Zhou, Hui Shen, Xieyuanli Chen, Dewen Hu,
Abstract summary: We introduce TAPFormer, a framework that performs temporal-consistent asynchronous fusion of frames and events for point tracking.<n>Key innovation is a Transient Asynchronous Fusion mechanism, which explicitly models the temporal evolution between discrete frames.<n>Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold.
Score: 37.273066799679135
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Tracking any point (TAP) is a fundamental yet challenging task in computer vision, requiring high precision and long-term motion reasoning. Recent attempts to combine RGB frames and event streams have shown promise, yet they typically rely on synchronous or non-adaptive fusion, leading to temporal misalignment and severe degradation when one modality fails. We introduce TAPFormer, a transformer-based framework that performs asynchronous temporal-consistent fusion of frames and events for robust and high-frequency arbitrary point tracking. Our key innovation is a Transient Asynchronous Fusion (TAF) mechanism, which explicitly models the temporal evolution between discrete frames through continuous event updates, bridging the gap between low-rate frames and high-rate events. In addition, a Cross-modal Locally Weighted Fusion (CLWF) module adaptively adjusts spatial attention according to modality reliability, yielding stable and discriminative features even under blur or low light. To evaluate our approach under realistic conditions, we construct a novel real-world frame-event TAP dataset under diverse illumination and motion conditions. Our method outperforms existing point trackers, achieving a 28.2% improvement in average pixel error within threshold. Moreover, on standard point tracking benchmarks, our tracker consistently achieves the best performance. Project website: tapformer.github.io

Related papers

SwiTrack: Tri-State Switch for Cross-Modal Object Tracking [74.15663758681849]
Cross-modal object tracking (CMOT) is an emerging task that maintains target consistency while the video stream switches between different modalities.<n>We propose SwiTrack, a novel state-switching framework that redefines CMOT through the deployment of three specialized streams.
arXiv Detail & Related papers (2025-11-20T10:52:54Z)
CETUS: Causal Event-Driven Temporal Modeling With Unified Variable-Rate Scheduling [18.82030002020162]
Event cameras capture asynchronous pixel-level brightness changes with microsecond temporal resolution.<n>Existing methods often convert event streams into intermediate representations such as frames, voxel grids, or point clouds.<n>We propose the Variable-Rate Spatial Event Mamba, a novel architecture that directly processes raw event streams without intermediate representations.
arXiv Detail & Related papers (2025-09-17T07:55:37Z)
What You Have is What You Track: Adaptive and Robust Multimodal Tracking [72.92244578461869]
We present the first comprehensive study on tracker performance with temporally incomplete multimodal data.<n>Our model achieves SOTA performance across 9 benchmarks, excelling in both conventional complete and missing modality settings.
arXiv Detail & Related papers (2025-07-08T11:40:21Z)
Adaptive Deadline and Batch Layered Synchronized Federated Learning [66.93447103966439]
Federated learning (FL) enables collaborative model training across distributed edge devices while preserving data privacy, and typically operates in a round-based synchronous manner.<n>We propose ADEL-FL, a novel framework that jointly optimize per-round deadlines and user-specific batch sizes for layer-wise aggregation.
arXiv Detail & Related papers (2025-05-29T19:59:18Z)
Fully Spiking Neural Networks for Unified Frame-Event Object Tracking [17.626181371045575]
We propose the first fully Spiking Frame-Event Tracking framework called SpikeFET.<n>This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm.<n>We show that proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption.
arXiv Detail & Related papers (2025-05-27T07:53:50Z)
Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation [47.75348821902489]
Current optical flow methods exploit the stable appearance of frame (or RGB) data to establish robust correspondences across time.<n>Event cameras, on the other hand, provide high-temporal-resolution motion cues and excel in challenging scenarios.<n>This study introduces a novel approach that uses a spatially dense modality to guide the aggregation of the temporally dense event modality.
arXiv Detail & Related papers (2025-01-01T13:40:09Z)
MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking [58.719310295870024]
This paper presents an event-based framework for tracking any point.<n>To resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process.<n>The method improves the $Survival_50$ metric by 17.9% over event-only tracking of any point baseline.
arXiv Detail & Related papers (2024-12-02T09:13:29Z)
Tracking Any Point with Frame-Event Fusion Network at High Frame Rate [16.749590397918574]
We propose an image-event fusion point tracker, FE-TAP. It combines the contextual information from image frames with the high temporal resolution of events. FE-TAP achieves high frame rate and robust point tracking under various challenging conditions.
arXiv Detail & Related papers (2024-09-18T13:07:19Z)
TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches. We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.