Related papers: Motion-Aware Transformer for Multi-Object Tracking

Motion-Aware Transformer for Multi-Object Tracking

URL: http://arxiv.org/abs/2509.21715v1
Date: Fri, 26 Sep 2025 00:25:30 GMT
Title: Motion-Aware Transformer for Multi-Object Tracking
Authors: Xu Yang, Gady Agam,
Abstract summary: We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance.<n>Experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics.
Score: 6.335488846185043
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-object tracking (MOT) in videos remains challenging due to complex object motions and crowded scenes. Recent DETR-based frameworks offer end-to-end solutions but typically process detection and tracking queries jointly within a single Transformer Decoder layer, leading to conflicts and degraded association accuracy. We introduce the Motion-Aware Transformer (MATR), which explicitly predicts object movements across frames to update track queries in advance. By reducing query collisions, MATR enables more consistent training and improves both detection and association. Extensive experiments on DanceTrack, SportsMOT, and BDD100k show that MATR delivers significant gains across standard metrics. On DanceTrack, MATR improves HOTA by more than 9 points over MOTR without additional data and reaches a new state-of-the-art score of 71.3 with supplementary data. MATR also achieves state-of-the-art results on SportsMOT (72.2 HOTA) and BDD100k (54.7 mTETA, 41.6 mHOTA) without relying on external datasets. These results demonstrate that explicitly modeling motion within end-to-end Transformers offers a simple yet highly effective approach to advancing multi-object tracking.

Related papers

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z)
Rt-Track: Robust Tricks for Multi-Pedestrian Tracking [4.271127739716044]
We propose a novel direction consistency method for smooth trajectory prediction (STP-DC) to increase the modeling of motion information. We also propose a hyper-grain feature embedding network (HG-FEN) to enhance the modeling of appearance models. To achieve state-of-the-art performance in MOT, we propose a robust tracker named Rt-track, incorporating various tricks and techniques.
arXiv Detail & Related papers (2023-03-16T22:08:29Z)
Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations [0.6144680854063939]
Transformer networks have been a focus of research in many fields in recent years, being able to surpass the state-of-the-art performance in different computer vision tasks.<n>In the task of Multiple Object Tracking (MOT), leveraging the power of Transformers remains relatively unexplored.<n>Among the pioneering efforts in this domain, TransCenter, a Transformer-based MOT architecture with dense object queries, demonstrated exceptional tracking capabilities while maintaining reasonable runtime.<n>We propose a post-processing mechanism grounded in the Track-by-Detection paradigm, aiming to refine the track displacement estimation.
arXiv Detail & Related papers (2022-10-24T19:47:58Z)
Global Tracking Transformers [76.58184022651596]
We present a novel transformer-based architecture for global multi-object tracking. The core component is a global tracking transformer that operates on objects from all frames in the sequence. Our framework seamlessly integrates into state-of-the-art large-vocabulary detectors to track any objects.
arXiv Detail & Related papers (2022-03-24T17:58:04Z)
VariabilityTrack:Multi-Object Tracking with Variable Speed Object Movement [1.6385815610837167]
Multi-object tracking (MOT) aims at estimating bounding boxes and identities of objects in videos. We propose a variable speed Kalman filter algorithm based on environmental feedback and improve the matching process.
arXiv Detail & Related papers (2022-03-12T12:39:41Z)
TransVOD: End-to-end Video Object Detection with Spatial-Temporal Transformers [96.981282736404]
We present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. Our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS.
arXiv Detail & Related papers (2022-01-13T16:17:34Z)
TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture. New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time. TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z)
Simultaneous Detection and Tracking with Motion Modelling for Multiple Object Tracking [94.24393546459424]
We introduce Deep Motion Modeling Network (DMM-Net) that can estimate multiple objects' motion parameters to perform joint detection and association. DMM-Net achieves PR-MOTA score of 12.80 @ 120+ fps for the popular UA-DETRAC challenge, which is better performance and orders of magnitude faster. We also contribute a synthetic large-scale public dataset Omni-MOT for vehicle tracking that provides precise ground-truth annotations.
arXiv Detail & Related papers (2020-08-20T08:05:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.