Related papers: DecoderTracker: Decoder-Only Method for Multiple-Object Tracking

DecoderTracker: Decoder-Only Method for Multiple-Object Tracking

URL: http://arxiv.org/abs/2310.17170v4
Date: Fri, 24 May 2024 03:53:26 GMT
Title: DecoderTracker: Decoder-Only Method for Multiple-Object Tracking
Authors: Liao Pan, Yang Feng, Wu Di, Liu Bo, Zhang Xingle,
Abstract summary: This paper attempts to construct a lightweight Decoder-only model: DecoderTracker for end-to-end multi-object tracking. Specifically, we have developed an image feature extraction network which can efficiently extract features from images to replace the encoder structure. On the DanceTrack dataset, without any bells and whistles, DecoderTracker's tracking performance slightly surpasses that of MOTR, with approximately twice the inference speed.
Score: 10.819349280398363
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Decoder-only models, such as GPT, have demonstrated superior performance in many areas compared to traditional encoder-decoder structure transformer models. Over the years, end-to-end models based on the traditional transformer structure, like MOTR, have achieved remarkable performance in multi-object tracking. However, the significant computational resource consumption of these models leads to less friendly inference speeds and training times. To address these issues, this paper attempts to construct a lightweight Decoder-only model: DecoderTracker for end-to-end multi-object tracking. Specifically, drawing on some real-time detection models, we have developed an image feature extraction network which can efficiently extract features from images to replace the encoder structure. In addition to minor innovations in the network, we analyze the potential reasons for the slow training of MOTR-like models and propose an effective training strategy to mitigate the issue of prolonged training times. On the DanceTrack dataset, without any bells and whistles, DecoderTracker's tracking performance slightly surpasses that of MOTR, with approximately twice the inference speed. Furthermore, DecoderTracker requires significantly less training time compared to MOTR.

Related papers

DELTAv2: Accelerating Dense 3D Tracking [79.63990337419514]
We propose a novel algorithm for accelerating dense long-term 3D point tracking in videos.<n>We introduce a coarse-to-fine strategy that begins tracking with a small subset of points and progressively expands the set of tracked trajectories.<n>The newly added trajectories are using a learnable module, which is trained end-to-end alongside the tracking network.
arXiv Detail & Related papers (2025-08-02T03:15:47Z)
Online Dense Point Tracking with Streaming Memory [54.22820729477756]
Dense point tracking is a challenging task requiring the continuous tracking of every point in the initial frame throughout a substantial portion of a video. Recent point tracking algorithms usually depend on sliding windows for indirect information propagation from the first frame to the current one. We present a lightweight and fast model with textbfStreaming memory for dense textbfPOint textbfTracking and online video processing.
arXiv Detail & Related papers (2025-03-09T06:16:49Z)
OneTrack-M: A multitask approach to transformer-based MOT models [0.0]
Multi-Object Tracking (MOT) is a critical problem in computer vision. OneTrack-M is a transformer-based MOT model designed to enhance tracking computational efficiency and accuracy.
arXiv Detail & Related papers (2025-02-06T20:02:06Z)
FastTrackTr:Towards Fast Multi-Object Tracking with Transformers [33.60043236101685]
Transformer-based multi-object tracking (MOT) models often suffer from slow inference speeds due to their structure or other issues.<n>This paper employs an efficient method of information transfer between frames on the DETR, constructing a fast and novel JDT-type MOT framework: FastTrackTr.<n>Thanks to the superiority of this information transfer method, our approach not only reduces the number of queries required during tracking but also avoids the excessive introduction of network structures.
arXiv Detail & Related papers (2024-11-24T12:34:02Z)
MCTrack: A Unified 3D Multi-Object Tracking Framework for Autonomous Driving [10.399817864597347]
This paper introduces MCTrack, a new 3D multi-object tracking method that achieves state-of-the-art (SOTA) performance across KITTI, nuScenes, and datasets.
arXiv Detail & Related papers (2024-09-23T11:26:01Z)
Learning Motion Blur Robust Vision Transformers with Dynamic Early Exit for Real-Time UAV Tracking [14.382072224997074]
Single-stream architectures utilizing pre-trained ViT backbones offer improved performance, efficiency, and robustness. We boost the efficiency of this framework by tailoring it into an adaptive framework that dynamically exits Transformer blocks for real-time UAV tracking. We also improve the effectiveness of ViTs in handling motion blur, a common issue in UAV tracking caused by the fast movements of either the UAV, the tracked objects, or both.
arXiv Detail & Related papers (2024-07-07T14:10:04Z)
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z)
4D ASR: Joint Beam Search Integrating CTC, Attention, Transducer, and Mask Predict Decoders [53.297697898510194]
We propose a joint modeling scheme where four decoders share the same encoder -- we refer to this as 4D modeling. To efficiently train the 4D model, we introduce a two-stage training strategy that stabilizes multitask learning. In addition, we propose three novel one-pass beam search algorithms by combining three decoders.
arXiv Detail & Related papers (2024-06-05T05:18:20Z)
Exploring Dynamic Transformer for Efficient Object Tracking [58.120191254379854]
We propose DyTrack, a dynamic transformer framework for efficient tracking. DyTrack automatically learns to configure proper reasoning routes for various inputs, gaining better utilization of the available computational budget. Experiments on multiple benchmarks demonstrate that DyTrack achieves promising speed-precision trade-offs with only a single model.
arXiv Detail & Related papers (2024-03-26T12:31:58Z)
Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture. We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference. A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z)
NASH: A Simple Unified Framework of Structured Pruning for Accelerating Encoder-Decoder Language Models [29.468888611690346]
We propose a simple and effective framework, NASH, that narrows the encoder and shortens the decoder networks of encoder-decoder models. Our findings highlight two insights: (1) the number of decoder layers is the dominant factor of inference speed, and (2) low sparsity in the pruned encoder network enhances generation quality.
arXiv Detail & Related papers (2023-10-16T04:27:36Z)
DeNoising-MOT: Towards Multiple Object Tracking with Severe Occlusions [52.63323657077447]
We propose DNMOT, an end-to-end trainable DeNoising Transformer for multiple object tracking. Specifically, we augment the trajectory with noises during training and make our model learn the denoising process in an encoder-decoder architecture. We conduct extensive experiments on the MOT17, MOT20, and DanceTrack datasets, and the experimental results show that our method outperforms previous state-of-the-art methods by a clear margin.
arXiv Detail & Related papers (2023-09-09T04:40:01Z)
Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking [3.781471919731034]
Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices. We propose to explore collaborative tracking learning (ColTrack) for frame-rate-insensitive MOT in a query-based end-to-end manner.
arXiv Detail & Related papers (2023-08-11T02:25:58Z)
Sparse Message Passing Network with Feature Integration for Online Multiple Object Tracking [6.510588721127479]
Experimental results demonstrate that a simple online MPN with these two contributions can perform better than many state-of-the-art methods. Our association method generalizes well and can also improve the results of private detection based methods.
arXiv Detail & Related papers (2022-12-06T14:10:57Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
Efficient Few-Shot Object Detection via Knowledge Inheritance [62.36414544915032]
Few-shot object detection (FSOD) aims at learning a generic detector that can adapt to unseen tasks with scarce training samples. We present an efficient pretrain-transfer framework (PTF) baseline with no computational increment. We also propose an adaptive length re-scaling (ALR) strategy to alleviate the vector length inconsistency between the predicted novel weights and the pretrained base weights.
arXiv Detail & Related papers (2022-03-23T06:24:31Z)
Efficient Visual Tracking with Exemplar Transformers [98.62550635320514]
We introduce the Exemplar Transformer, an efficient transformer for real-time visual object tracking. E.T.Track, our visual tracker that incorporates Exemplar Transformer layers, runs at 47 fps on a CPU. This is up to 8 times faster than other transformer-based models.
arXiv Detail & Related papers (2021-12-17T18:57:54Z)
Faster Meta Update Strategy for Noise-Robust Deep Learning [62.08964100618873]
We introduce a novel Faster Meta Update Strategy (FaMUS) to replace the most expensive step in the meta gradient with a faster layer-wise approximation. We show our method is able to save two-thirds of the training time while still maintaining the comparable or achieving even better generalization performance.
arXiv Detail & Related papers (2021-04-30T16:19:07Z)
Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component. The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.