Related papers: SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking

URL: http://arxiv.org/abs/2411.06780v2
Date: Sat, 25 Jan 2025 08:52:03 GMT
Title: SynCL: A Synergistic Training Strategy with Instance-Aware Contrastive Learning for End-to-End Multi-Camera 3D Tracking
Authors: Shubo Lin, Yutong Kou, Zirui Wu, Shaoru Wang, Bing Li, Weiming Hu, Jin Gao,
Abstract summary: SynCL is a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking.<n>We show that SynCL consistently delivers improvements when integrated with the training stage of various query-based 3D MOT trackers.<n>Without additional inference costs, SynCL improves the state-of-the-art PF-Track method by $+3.9%$ AMOTA and $+2.0%$ NDS on the nuScenes dataset.
Score: 34.90147791481045
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While existing query-based 3D end-to-end visual trackers integrate detection and tracking via the tracking-by-attention paradigm, these two chicken-and-egg tasks encounter optimization difficulties when sharing the same parameters. Our findings reveal that these difficulties arise due to two inherent constraints on the self-attention mechanism, i.e., over-deduplication for object queries and self-centric attention for track queries. In contrast, removing self-attention mechanism not only minimally impacts regression predictions of the tracker, but also tends to generate more latent candidate boxes. Based on these analyses, we present SynCL, a novel plug-and-play synergistic training strategy designed to co-facilitate multi-task learning for detection and tracking. Specifically, we propose a Task-specific Hybrid Matching module for a weight-shared cross-attention-based decoder that matches the targets of track queries with multiple object queries to exploit promising candidates overlooked by the self-attention mechanism. To dynamically select optimal candidates for the one-to-many matching, we also design a Cost-based Query Filtering module controlled by model training status. Moreover, we introduce Instance-aware Contrastive Learning to break through the barrier of self-centric attention for track queries, effectively bridging the gap between detection and tracking. Extensive experiments demonstrate that SynCL consistently delivers improvements when integrated with the training stage of various query-based 3D MOT trackers. Without additional inference costs, SynCL improves the state-of-the-art PF-Track method by $+3.9\%$ AMOTA and $+2.0\%$ NDS on the nuScenes dataset.

Related papers

CAMELTrack: Context-Aware Multi-cue ExpLoitation for Online Multi-Object Tracking [68.24998698508344]
We introduce CAMEL, a novel association module for Context-Aware Multi-Cue ExpLoitation.<n>Unlike end-to-end detection-by-tracking approaches, our method remains lightweight and fast to train while being able to leverage external off-the-shelf models.<n>Our proposed online tracking pipeline, CAMELTrack, achieves state-of-the-art performance on multiple tracking benchmarks.
arXiv Detail & Related papers (2025-05-02T13:26:23Z)
Multi-object Tracking by Detection and Query: an efficient end-to-end manner [23.926668750263488]
Multi-object tracking is advancing through two dominant paradigms: traditional tracking by detection and newly emerging tracking by query. We propose the tracking-by-detection-and-query paradigm, which is achieved by a Learnable Associator. Compared to tracking-by-query models, LAID achieves competitive tracking accuracy with notably higher training efficiency.
arXiv Detail & Related papers (2024-11-09T14:38:08Z)
Lost and Found: Overcoming Detector Failures in Online Multi-Object Tracking [15.533652456081374]
Multi-object tracking (MOT) endeavors to precisely estimate identities and positions of multiple objects over time. Modern detectors may occasionally miss some objects in certain frames, causing trackers to cease tracking prematurely. We propose BUSCA, meaning to search', a versatile framework compatible with any online TbD system.
arXiv Detail & Related papers (2024-07-14T10:45:12Z)
ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association [15.161640917854363]
We introduce ADA-Track, a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention. We integrate this association module into the decoder layer of a DETR-based 3D detector.
arXiv Detail & Related papers (2024-05-14T19:02:33Z)
Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets. Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z)
Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task. Our proposed framework serializes language description and bounding box into a sequence of discrete tokens. In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z)
S$^3$Track: Self-supervised Tracking with Soft Assignment Flow [45.77333923477176]
We study self-supervised multiple object tracking without using any video-level association labels. We propose differentiable soft object assignment for object association. We evaluate our proposed model on the KITTI, nuScenes, and Argoverse datasets.
arXiv Detail & Related papers (2023-05-17T06:25:40Z)
Unified Sequence-to-Sequence Learning for Single- and Multi-Modal Visual Object Tracking [64.28025685503376]
SeqTrack casts visual tracking as a sequence generation task, forecasting object bounding boxes in an autoregressive manner. SeqTrackv2 integrates a unified interface for auxiliary modalities and a set of task-prompt tokens to specify the task. This sequence learning paradigm not only simplifies the tracking framework, but also showcases superior performance across 14 challenging benchmarks.
arXiv Detail & Related papers (2023-04-27T17:56:29Z)
You Only Need Two Detectors to Achieve Multi-Modal 3D Multi-Object Tracking [9.20064374262956]
The proposed framework can achieve robust tracking by using only a 2D detector and a 3D detector. It is proven more accurate than many of the state-of-the-art TBD-based multi-modal tracking methods.
arXiv Detail & Related papers (2023-04-18T02:45:18Z)
DIVOTrack: A Novel Dataset and Baseline Method for Cross-View Multi-Object Tracking in DIVerse Open Scenes [74.64897845999677]
We introduce a new cross-view multi-object tracking dataset for DIVerse Open scenes with dense tracking pedestrians. Our DIVOTrack has fifteen distinct scenarios and 953 cross-view tracks, surpassing all cross-view multi-object tracking datasets currently available. Furthermore, we provide a novel baseline cross-view tracking method with a unified joint detection and cross-view tracking framework named CrossMOT.
arXiv Detail & Related papers (2023-02-15T14:10:42Z)
3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds. Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
InterTrack: Interaction Transformer for 3D Multi-Object Tracking [9.283656931246645]
3D multi-object tracking (MOT) is a key problem for autonomous vehicles. Our proposed solution, InterTrack, generates discriminative object representations for data association. We validate our approach on the nuScenes 3D MOT benchmark, where we observe significant improvements.
arXiv Detail & Related papers (2022-08-17T03:24:36Z)
Transformer-based assignment decision network for multiple object tracking [2.2920634931825803]
We introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of explicit optimization during inference.<n>Our proposed approach demonstrates strong performance in most evaluation metrics despite its simple nature as a tracker.
arXiv Detail & Related papers (2022-08-06T19:47:32Z)
Unified Transformer Tracker for Object Tracking [58.65901124158068]
We present the Unified Transformer Tracker (UTT) to address tracking problems in different scenarios with one paradigm. A track transformer is developed in our UTT to track the target in both Single Object Tracking (SOT) and Multiple Object Tracking (MOT)
arXiv Detail & Related papers (2022-03-29T01:38:49Z)
Distractor-Aware Fast Tracking via Dynamic Convolutions and MOT Philosophy [63.91005999481061]
A practical long-term tracker typically contains three key properties, i.e. an efficient model design, an effective global re-detection strategy and a robust distractor awareness mechanism. We propose a two-task tracking frame work (named DMTrack) to achieve distractor-aware fast tracking via Dynamic convolutions (d-convs) and Multiple object tracking (MOT) philosophy. Our tracker achieves state-of-the-art performance on the LaSOT, OxUvA, TLP, VOT2018LT and VOT 2019LT benchmarks and runs in real-time (3x faster
arXiv Detail & Related papers (2021-04-25T00:59:53Z)
Track to Detect and Segment: An Online Multi-Object Tracker [81.15608245513208]
TraDeS is an online joint detection and tracking model, exploiting tracking clues to assist detection end-to-end. TraDeS infers object tracking offset by a cost volume, which is used to propagate previous object features.
arXiv Detail & Related papers (2021-03-16T02:34:06Z)
DEFT: Detection Embeddings for Tracking [3.326320568999945]
We propose an efficient joint detection and tracking model named DEFT. Our approach relies on an appearance-based object matching network jointly-learned with an underlying object detection network. DEFT has comparable accuracy and speed to the top methods on 2D online tracking leaderboards.
arXiv Detail & Related papers (2021-02-03T20:00:44Z)
TrackFormer: Multi-Object Tracking with Transformers [92.25832593088421]
TrackFormer is an end-to-end multi-object tracking and segmentation model based on an encoder-decoder Transformer architecture. New track queries are spawned by the DETR object detector and embed the position of their corresponding object over time. TrackFormer achieves a seamless data association between frames in a new tracking-by-attention paradigm.
arXiv Detail & Related papers (2021-01-07T18:59:29Z)
Chained-Tracker: Chaining Paired Attentive Regression Results for End-to-End Joint Multiple-Object Detection and Tracking [102.31092931373232]
We propose a simple online model named Chained-Tracker (CTracker), which naturally integrates all the three subtasks into an end-to-end solution. The two major novelties: chained structure and paired attentive regression, make CTracker simple, fast and effective.
arXiv Detail & Related papers (2020-07-29T02:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.