DiffusionTrack: Diffusion Model For Multi-Object Tracking
- URL: http://arxiv.org/abs/2308.09905v2
- Date: Wed, 21 Feb 2024 14:18:26 GMT
- Title: DiffusionTrack: Diffusion Model For Multi-Object Tracking
- Authors: Run Luo, Zikai Song, Lintao Ma, Jinlin Wei, Wei Yang, Min Yang
- Abstract summary: Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames.
Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods.
We propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process.
- Score: 15.025051933538043
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-object tracking (MOT) is a challenging vision task that aims to detect
individual objects within a single frame and associate them across multiple
frames. Recent MOT approaches can be categorized into two-stage
tracking-by-detection (TBD) methods and one-stage joint detection and tracking
(JDT) methods. Despite the success of these approaches, they also suffer from
common problems, such as harmful global or local inconsistency, poor trade-off
between robustness and model complexity, and lack of flexibility in different
scenes within the same video. In this paper we propose a simple but robust
framework that formulates object detection and association jointly as a
consistent denoising diffusion process from paired noise boxes to paired
ground-truth boxes. This novel progressive denoising diffusion strategy
substantially augments the tracker's effectiveness, enabling it to discriminate
between various objects. During the training stage, paired object boxes diffuse
from paired ground-truth boxes to random distribution, and the model learns
detection and tracking simultaneously by reversing this noising process. In
inference, the model refines a set of paired randomly generated boxes to the
detection and tracking results in a flexible one-step or multi-step denoising
diffusion process. Extensive experiments on three widely used MOT benchmarks,
including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves
competitive performance compared to the current state-of-the-art methods.
Related papers
- DINTR: Tracking via Diffusion-based Interpolation [12.130669304428565]
This work proposes a novel diffusion-based methodology to formulate the tracking task.
Our INterpolation TrackeR (DINTR) presents a promising new paradigm and achieves a superior multiplicity on seven benchmarks across five indicator representations.
arXiv Detail & Related papers (2024-10-14T00:41:58Z) - ConsistencyTrack: A Robust Multi-Object Tracker with a Generation Strategy of Consistency Model [20.259334882471574]
Multi-object tracking (MOT) is a critical technology in computer vision, designed to detect multiple targets in video sequences and assign each target a unique ID per frame.
Existed MOT methods excel at accurately tracking multiple objects in real-time across various scenarios.
We propose a novel ConsistencyTrack, joint detection and tracking(JDT) framework that formulates detection and association as a denoising diffusion process on bounding boxes.
arXiv Detail & Related papers (2024-08-28T05:53:30Z) - Cross-Modal Learning for Anomaly Detection in Complex Industrial Process: Methodology and Benchmark [19.376814754500625]
Anomaly detection in complex industrial processes plays a pivotal role in ensuring efficient, stable, and secure operation.
This paper proposes a cross-modal Transformer to facilitate anomaly detection by exploring the correlation between visual features (video) and process variables (current) in the context of the fused magnesium smelting process.
We present a pioneering cross-modal benchmark of the fused magnesium smelting process, featuring synchronously acquired video and current data for over 2.2 million samples.
arXiv Detail & Related papers (2024-06-13T11:40:06Z) - Diffusion-Based Particle-DETR for BEV Perception [94.88305708174796]
Bird-Eye-View (BEV) is one of the most widely-used scene representations for visual perception in Autonomous Vehicles (AVs)
Recent diffusion-based methods offer a promising approach to uncertainty modeling for visual perception but fail to effectively detect small objects in the large coverage of the BEV.
Here, we address this problem by combining the diffusion paradigm with current state-of-the-art 3D object detectors in BEV.
arXiv Detail & Related papers (2023-12-18T09:52:14Z) - Multiple Object Tracking based on Occlusion-Aware Embedding Consistency
Learning [46.726678333518066]
Occlusion Prediction Module (OPM) and Occlusion-Aware Association Module (OAAM)
OPM predicts occlusion information for each true detection, facilitating the selection of valid samples for consistency learning of the track's visual embedding.
OAAM generates two separate embeddings for each track, guaranteeing consistency in both unoccluded and occluded detections.
arXiv Detail & Related papers (2023-11-05T06:08:58Z) - DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking.
Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture.
We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z) - Diffusion-based 3D Object Detection with Random Boxes [58.43022365393569]
Existing anchor-based 3D detection methods rely on empiricals setting of anchors, which makes the algorithms lack elegance.
Our proposed Diff3Det migrates the diffusion model to proposal generation for 3D object detection by considering the detection boxes as generative targets.
In the inference stage, the model progressively refines a set of random boxes to the prediction results.
arXiv Detail & Related papers (2023-09-05T08:49:53Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - CamoDiffusion: Camouflaged Object Detection via Conditional Diffusion
Models [72.93652777646233]
Camouflaged Object Detection (COD) is a challenging task in computer vision due to the high similarity between camouflaged objects and their surroundings.
We propose a new paradigm that treats COD as a conditional mask-generation task leveraging diffusion models.
Our method, dubbed CamoDiffusion, employs the denoising process of diffusion models to iteratively reduce the noise of the mask.
arXiv Detail & Related papers (2023-05-29T07:49:44Z) - Multimodal Object Detection via Bayesian Fusion [59.31437166291557]
We study multimodal object detection with RGB and thermal cameras, since the latter can provide much stronger object signatures under poor illumination.
Our key contribution is a non-learned late-fusion method that fuses together bounding box detections from different modalities.
We apply our approach to benchmarks containing both aligned (KAIST) and unaligned (FLIR) multimodal sensor data.
arXiv Detail & Related papers (2021-04-07T04:03:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.