SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction
- URL: http://arxiv.org/abs/2511.11824v1
- Date: Fri, 14 Nov 2025 19:25:05 GMT
- Title: SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction
- Authors: Zhongping Dong, Pengyang Yu, Shuangjian Li, Liming Chen, Mohand Tahar Kechadi,
- Abstract summary: We introduce textbfSOTFormer, a minimal constant-memory temporal transformer.<n>It unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework.<n>On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM)
- Score: 3.08657139423562
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.
Related papers
- Event-based Visual Deformation Measurement [76.25283405575108]
Visual Deformation Measurement aims to recover dense deformation fields by tracking surface motion from camera observations.<n>Traditional image-based methods rely on minimal inter-frame motion to constrain the correspondence search space.<n>We propose an event-frame fusion framework that exploits events for temporally dense motion cues and frames for spatially dense precise estimation.
arXiv Detail & Related papers (2026-02-16T01:04:48Z) - Model Optimization for Multi-Camera 3D Detection and Tracking [13.756560739163362]
Outside-in multi-camera perception is increasingly important in indoor environments.<n>We evaluate Sparse4D, a query-based 3D detection and tracking framework.<n>We study reduced input frame rates, post-training quantization, transfer to the WILDTRACK benchmark, and Transformer Engine mixedprecision fine-tuning.
arXiv Detail & Related papers (2026-01-31T01:51:30Z) - FFP-300K: Scaling First-Frame Propagation for Generalizable Video Editing [97.35186681023025]
We introduce FFP-300K, a new large-scale dataset of high-fidelity video pairs at 720p resolution and 81 frames in length.<n>We propose a novel framework designed for true guidance-free FFP that resolves the tension between maintaining first-frame appearance and preserving source video motion.
arXiv Detail & Related papers (2026-01-05T01:46:22Z) - Fully Spiking Neural Networks for Unified Frame-Event Object Tracking [17.626181371045575]
We propose the first fully Spiking Frame-Event Tracking framework called SpikeFET.<n>This network achieves synergistic integration of convolutional local feature extraction and Transformer-based global modeling within the spiking paradigm.<n>We show that proposed framework achieves superior tracking accuracy over existing methods while significantly reducing power consumption.
arXiv Detail & Related papers (2025-05-27T07:53:50Z) - Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z) - MATE: Motion-Augmented Temporal Consistency for Event-based Point Tracking [58.719310295870024]
This paper presents an event-based framework for tracking any point.<n>To resolve ambiguities caused by event sparsity, a motion-guidance module incorporates kinematic vectors into the local matching process.<n>The method improves the $Survival_50$ metric by 17.9% over event-only tracking of any point baseline.
arXiv Detail & Related papers (2024-12-02T09:13:29Z) - Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking.<n>DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget.<n>Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z) - Spatio-Temporal Bi-directional Cross-frame Memory for Distractor Filtering Point Cloud Single Object Tracking [2.487142846438629]
3 single object tracking within LIDAR point is pivotal task in computer vision.
Existing methods, which depend solely on appearance matching via networks or utilize information from successive frames, encounter significant challenges.
We design an innovative cross-frame bi-temporal motion tracker, named STMD-Tracker, to mitigate these challenges.
arXiv Detail & Related papers (2024-03-23T13:15:44Z) - Autoregressive Queries for Adaptive Tracking with Spatio-TemporalTransformers [55.46413719810273]
rich-temporal information is crucial to the complicated target appearance in visual tracking.
Our method improves the tracker's performance on six popular tracking benchmarks.
arXiv Detail & Related papers (2024-03-15T02:39:26Z) - ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template.
This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames.
We revamped the framework with Progressive Context.
Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.