Related papers: Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?

Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?

URL: http://arxiv.org/abs/2405.14119v2
Date: Tue, 25 Mar 2025 06:46:45 GMT
Title: Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?
Authors: Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu,
Abstract summary: Multi-Object Tracking (MOT) has demonstrated success in short-term association within the separated tracking-by-detection online paradigm.<n>In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs.<n>We introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT.
Score: 36.5272157173876
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Multi-Object Tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at https://github.com/chongweiliu/PuTR .

Related papers

CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking [68.24998698508344]
We introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation.<n>Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models.<n>Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks.
arXiv Detail & Related papers (2025-05-02T13:26:23Z)
ACTrack: Adding Spatio-Temporal Condition for Visual Object Tracking [0.5371337604556311]
Efficiently modeling-temporal relations of objects is a key challenge in visual object tracking (VOT) Existing methods track by appearance-based similarity or long-term relation modeling, resulting in rich temporal contexts between consecutive frames being easily overlooked. In this paper we present ACTrack, a new framework with additive pre-temporal tracking framework with large memory conditions. It preserves the quality and capabilities of the pre-trained backbone by freezing its parameters, and makes a trainable lightweight additive net to model temporal relations in tracking. We design an additive siamese convolutional network to ensure the integrity of spatial features and temporal sequence
arXiv Detail & Related papers (2024-02-27T07:34:08Z)
3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking [15.330384668966806]
State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter. We propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. Our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively.
arXiv Detail & Related papers (2023-08-12T19:19:58Z)
Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches. This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z)
Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream. At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank. To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z)
Graph Decision Transformer [83.76329715043205]
Graph Decision Transformer (GDT) is a novel offline reinforcement learning approach. GDT models the input sequence into a causal graph to capture potential dependencies between fundamentally different concepts. Our experiments show that GDT matches or surpasses the performance of state-of-the-art offline RL methods on image-based Atari and OpenAI Gym.
arXiv Detail & Related papers (2023-03-07T09:10:34Z)
ProContEXT: Exploring Progressive Context Transformer for Tracking [20.35886416084831]
Existing Visual Object Tracking (VOT) only takes the target area in the first frame as a template. This causes tracking to inevitably fail in fast-changing and crowded scenes, as it cannot account for changes in object appearance between frames. We revamped the framework with Progressive Context. Transformer Tracker (ProContEXT), which coherently exploits spatial and temporal contexts to predict object motion trajectories.
arXiv Detail & Related papers (2022-10-27T14:47:19Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD) It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches. Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z)
Transformer Tracking [76.96796612225295]
Correlation acts as a critical role in the tracking field, especially in popular Siamese-based trackers. This work presents a novel attention-based feature fusion network, which effectively combines the template and search region features solely using attention. Experiments show that our TransT achieves very promising results on six challenging datasets.
arXiv Detail & Related papers (2021-03-29T09:06:55Z)
Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking [47.205979159070445]
We bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking. Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches. Our method sets several new state-of-the-art records on prevalent tracking benchmarks.
arXiv Detail & Related papers (2021-03-22T09:20:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.