Related papers: NOOUGAT: Towards Unified Online and Offline Multi-Object Tracking

NOOUGAT: Towards Unified Online and Offline Multi-Object Tracking

URL: http://arxiv.org/abs/2509.02111v1
Date: Tue, 02 Sep 2025 09:08:24 GMT
Title: NOOUGAT: Towards Unified Online and Offline Multi-Object Tracking
Authors: Benjamin Missaoui, Orcun Cetintas, Guillem Brasó, Tim Meinhardt, Laura Leal-Taixé,
Abstract summary: NOOUGAT is the first tracker to operate with arbitrary temporal horizons.<n>It improves textitonline AssA by +2.3 on DanceTrack, +9.2 on SportsMOT, and +5.0 on MOT20, with even greater gains in textitoffline mode.
Score: 31.46043749958963
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The long-standing division between \textit{online} and \textit{offline} Multi-Object Tracking (MOT) has led to fragmented solutions that fail to address the flexible temporal requirements of real-world deployment scenarios. Current \textit{online} trackers rely on frame-by-frame hand-crafted association strategies and struggle with long-term occlusions, whereas \textit{offline} approaches can cover larger time gaps, but still rely on heuristic stitching for arbitrarily long sequences. In this paper, we introduce NOOUGAT, the first tracker designed to operate with arbitrary temporal horizons. NOOUGAT leverages a unified Graph Neural Network (GNN) framework that processes non-overlapping subclips, and fuses them through a novel Autoregressive Long-term Tracking (ALT) layer. The subclip size controls the trade-off between latency and temporal context, enabling a wide range of deployment scenarios, from frame-by-frame to batch processing. NOOUGAT achieves state-of-the-art performance across both tracking regimes, improving \textit{online} AssA by +2.3 on DanceTrack, +9.2 on SportsMOT, and +5.0 on MOT20, with even greater gains in \textit{offline} mode.

Related papers

TeFlow: Enabling Multi-frame Supervision for Self-Supervised Feed-forward Scene Flow Estimation [14.239684633948746]
Multi-frame supervision has the potential to provide more stable guidance by incorporating motion cues from past frames.<n>We present TeFlow, enabling multi-frame supervision for feed-forward models by mining temporally consistent supervision.<n>Our method performs on par with leading optimization-based methods, yet speeds up 150 times.
arXiv Detail & Related papers (2026-02-22T05:50:16Z)
Offline-Poly: A Polyhedral Framework For Offline 3D Multi-Object Tracking [11.527022085205012]
offline 3D MOT is a critical component of the 4D auto-labeling process.<n>We propose Offline-Poly, a general offline 3D MOT method based on a tracking-centric design.
arXiv Detail & Related papers (2026-02-14T13:34:21Z)
Track-On2: Enhancing Online Point Tracking with Memory [57.820749134569574]
We extend our prior model Track-On into Track-On2, a simple and efficient transformer-based model for online long-term tracking.<n>Track-On2 improves both performance and efficiency through architectural refinements, more effective use of memory, and improved synthetic training strategies.
arXiv Detail & Related papers (2025-09-23T15:00:18Z)
On-the-Fly Adaptive Distillation of Transformer to Dual-State Linear Attention [53.22963042513293]
Large language models (LLMs) excel at capturing global token dependencies via self-attention but face prohibitive compute and memory costs on lengthy inputs.<n>We first propose dual-state linear attention (A), a novel design that maintains two hidden states-one for preserving historical context and one for tracking recencythereby mitigating the short-range bias typical of linear-attention architectures.<n>We introduce DSLA-Serve, an online adaptive distillation framework that progressively replaces Transformer layers DSLA layers at inference time, guided by a sensitivity-based layer ordering.
arXiv Detail & Related papers (2025-06-11T01:25:06Z)
CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking [68.24998698508344]
We introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation.<n>Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models.<n>Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks.
arXiv Detail & Related papers (2025-05-02T13:26:23Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video.<n>Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one.<n>We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
Track-On: Transformer-based Online Point Tracking with Memory [34.744546679670734]
We introduce Track-On, a simple transformer-based model designed for online long-term point tracking.<n>Unlike prior methods that depend on full temporal modeling, our model processes video frames causally without access to future frames.<n>At inference time, it employs patch classification and refinement to identify correspondences and track points with high accuracy.
arXiv Detail & Related papers (2025-01-30T17:04:11Z)
Exploiting Multimodal Spatial-temporal Patterns for Video Object Tracking [53.33637391723555]
We propose a unified multimodal spatial-temporal tracking approach named STTrack.<n>In contrast to previous paradigms, we introduced a temporal state generator (TSG) that continuously generates a sequence of tokens containing multimodal temporal information.<n>These temporal information tokens are used to guide the localization of the target in the next time state, establish long-range contextual relationships between video frames, and capture the temporal trajectory of the target.
arXiv Detail & Related papers (2024-12-20T09:10:17Z)
Explicit Visual Prompts for Visual Object Tracking [23.561539973210248]
textbfEVPTrack is a visual tracking framework that exploits explicit visual prompts between consecutive frames. We show that our framework can achieve competitive performance at a real-time by exploiting both explicit and multi-scale information.
arXiv Detail & Related papers (2024-01-06T07:12:07Z)
ODTrack: Online Dense Temporal Token Learning for Visual Tracking [22.628561792412686]
ODTrack is a video-level tracking pipeline that densely associates contextual relationships of video frames in an online token propagation manner. It achieves a new itSOTA performance on seven benchmarks, while running at real-time speed.
arXiv Detail & Related papers (2024-01-03T11:44:09Z)
DVIS: Decoupled Video Instance Segmentation Framework [15.571072365208872]
Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. We propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement.
arXiv Detail & Related papers (2023-06-06T05:24:15Z)
Temporal Aggregation and Propagation Graph Neural Networks for Dynamic Representation [67.26422477327179]
Temporal graphs exhibit dynamic interactions between nodes over continuous time. We propose a novel method of temporal graph convolution with the whole neighborhood. Our proposed TAP-GNN outperforms existing temporal graph methods by a large margin in terms of both predictive performance and online inference latency.
arXiv Detail & Related papers (2023-04-15T08:17:18Z)
Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching. Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips. The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames. Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z)
IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding Alignment [58.8330387551499]
We formulate the problem as estimation of point-wise trajectories (i.e., smooth curves) We propose IDEA-Net, an end-to-end deep learning framework, which disentangles the problem under the assistance of the explicitly learned temporal consistency. We demonstrate the effectiveness of our method on various point cloud sequences and observe large improvement over state-of-the-art methods both quantitatively and visually.
arXiv Detail & Related papers (2022-03-22T10:14:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.