TrackFormer: Multi-Object Tracking with Transformers
- URL: http://arxiv.org/abs/2101.02702v1
- Date: Thu, 7 Jan 2021 18:59:29 GMT
- Title: TrackFormer: Multi-Object Tracking with Transformers
- Authors: Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, Christoph
Feichtenhofer
- Abstract summary: TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture.
New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time.
TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
- Score: 92.25832593088421
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present TrackFormer, an end-to-end multi-object tracking and segmentation
model based on an encoder-decoder Transformer architecture. Our approach
introduces track query embeddings which follow objects through a video sequence
in an autoregressive fashion. New track queries are spawned by the DETR object
detector and embed the position of their corresponding object over time. The
Transformer decoder adjusts track query embeddings from frame to frame, thereby
following the changing object positions. TrackFormer achieves a seamless data
association between frames in a new tracking-by-attention paradigm by self- and
encoder-decoder attention mechanisms which simultaneously reason about
location, occlusion, and object identity. TrackFormer yields state-of-the-art
performance on the tasks of multi-object tracking (MOT17) and segmentation
(MOTS20). We hope our unified way of performing detection and tracking will
foster future research in multi-object tracking and video understanding. Code
will be made publicly available.
Related papers
- HSTrack: Bootstrap End-to-End Multi-Camera 3D Multi-object Tracking with Hybrid Supervision [34.7347336548199]
In camera-based 3D multi-object tracking (MOT), the prevailing methods follow the tracking-by-query-propagation paradigm.
We present HSTrack, a novel plug-and-play method designed to co-facilitate multi-task learning for detection and tracking.
arXiv Detail & Related papers (2024-11-11T08:18:49Z) - Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking [15.533652456081374]
Multi-object tracking (MOT) endeavors to precisely estimate identities and positions of multiple objects over time.
Modern detectors may occasionally miss some objects in certain frames, causing trackers to cease tracking prematurely.
We propose BUSCA, meaning to search', a versatile framework compatible with any online TbD system.
arXiv Detail & Related papers (2024-07-14T10:45:12Z) - Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking [64.28025685503376]
SeqTrack casts visual tracking as a sequence generation task, forecasting object bounding boxes in an autoregressive manner.
SeqTrackv2 integrates a unified interface for auxiliary modalities and a set of task-prompt tokens to specify the task.
This sequence learning paradigm not only simplifies the tracking framework, but also showcases superior performance across 14 challenging benchmarks.
arXiv Detail & Related papers (2023-04-27T17:56:29Z) - DIVOTrack: A Novel Dataset and Baseline Method for Cross-View
Multi-Object Tracking in DIVerse Open Scenes [74.64897845999677]
We introduce a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians.
Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available.
Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT.
arXiv Detail & Related papers (2023-02-15T14:10:42Z) - Tracking by Associating Clips [110.08925274049409]
In this paper, we investigate an alternative by treating object association as clip-wise matching.
Our new perspective views a single long video sequence as multiple short clips, and then the tracking is performed both within and between the clips.
The benefits of this new approach are two folds. First, our method is robust to tracking error accumulation or propagation, as the video chunking allows bypassing the interrupted frames.
Second, the multiple frame information is aggregated during the clip-wise matching, resulting in a more accurate long-range track association than the current frame-wise matching.
arXiv Detail & Related papers (2022-12-20T10:33:17Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual
Tracking [47.205979159070445]
We bridge the individual video frames and explore the temporal contexts across them via a transformer architecture for robust object tracking.
Different from classic usage of the transformer in natural language processing tasks, we separate its encoder and decoder into two parallel branches.
Our method sets several new state-of-the-art records on prevalent tracking benchmarks.
arXiv Detail & Related papers (2021-03-22T09:20:05Z) - Track to Detect and Segment: An Online Multi-Object Tracker [81.15608245513208]
TraDeS is an online joint detection and tracking model, exploiting tracking clues to assist detection end-to-end.
TraDeS infers object tracking offset by a cost volume, which is used to propagate previous object features.
arXiv Detail & Related papers (2021-03-16T02:34:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.