MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking
- URL: http://arxiv.org/abs/2305.14298v1
- Date: Tue, 23 May 2023 17:40:13 GMT
- Title: MOTRv3: Release-Fetch Supervision for End-to-End Multi-Object Tracking
- Authors: En Yu, Tiancai Wang, Zhuoling Li, Yuang Zhang, Xiangyu Zhang, Wenbing
Tao
- Abstract summary: We propose MOTRv3, which balances the label assignment process using the developed release-fetch supervision strategy.
Besides, another two strategies named pseudo label distillation and track group denoising are designed to further improve the supervision for detection and association.
- Score: 27.493264998858955
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although end-to-end multi-object trackers like MOTR enjoy the merits of
simplicity, they suffer from the conflict between detection and association
seriously, resulting in unsatisfactory convergence dynamics. While MOTRv2
partly addresses this problem, it demands an additional detection network for
assistance. In this work, we serve as the first to reveal that this conflict
arises from the unfair label assignment between detect queries and track
queries during training, where these detect queries recognize targets and track
queries associate them. Based on this observation, we propose MOTRv3, which
balances the label assignment process using the developed release-fetch
supervision strategy. In this strategy, labels are first released for detection
and gradually fetched back for association. Besides, another two strategies
named pseudo label distillation and track group denoising are designed to
further improve the supervision for detection and association. Without the
assistance of an extra detection network during inference, MOTRv3 achieves
impressive performance across diverse benchmarks, e.g., MOT17, DanceTrack.
Related papers
- ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association [15.161640917854363]
We introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras.
We introduce a learnable data association module based on edge-augmented cross-attention.
We integrate this association module into the decoder layer of a DETR-based 3D detector.
arXiv Detail & Related papers (2024-05-14T19:02:33Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Bridging the Gap Between End-to-end and Non-End-to-end Multi-Object
Tracking [27.74953961900086]
Existing end-to-end Multi-Object Tracking (e2e-MOT) methods have not surpassed non-end-to-end tracking-by-detection methods.
We present Co-MOT, a simple and effective method to facilitate e2e-MOT by a novel coopetition label assignment with a shadow concept.
arXiv Detail & Related papers (2023-05-22T05:18:34Z) - MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained
Object Detectors [14.69168925956635]
MOTRv2 is a pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector.
It ranks the 1st place (73.4% HOTA on DanceTrack) in the 1st Multiple People Tracking in Group Dance Challenge.
It reaches state-of-the-art performance on the BDD100K dataset.
arXiv Detail & Related papers (2022-11-17T18:57:12Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT
Philosophy [63.91005999481061]
A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism.
We propose a two-task tracking frame work (named DMTrack) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy.
Our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT 2019LT benchmarks and runs in real-time (3x faster
arXiv Detail & Related papers (2021-04-25T00:59:53Z) - Object Detection Made Simpler by Eliminating Heuristic NMS [70.93004137521946]
We show a simple NMS-free, end-to-end object detection framework.
We attain on par or even improved detection accuracy compared with the original one-stage detector.
arXiv Detail & Related papers (2021-01-28T02:38:29Z) - Rethinking the competition between detection and ReID in Multi-Object
Tracking [44.59367033562385]
One-shot models which jointly learn detection and identification embeddings, have drawn great attention in multi-object tracking (MOT)
In this paper, we propose a novel reciprocal network (REN) with a self-relation and cross-relation design to better learn task-dependent representations.
We also introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings.
arXiv Detail & Related papers (2020-10-23T02:44:59Z) - Dense Scene Multiple Object Tracking with Box-Plane Matching [73.54369833671772]
Multiple Object Tracking (MOT) is an important task in computer vision.
We propose the Box-Plane Matching (BPM) method to improve the MOT performacne in dense scenes.
With the effectiveness of the three modules, our team achieves the 1st place on the Track-1 leaderboard in the ACM MM Grand Challenge HiEve 2020.
arXiv Detail & Related papers (2020-07-30T16:39:22Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.