TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking
- URL: http://arxiv.org/abs/2407.04327v2
- Date: Mon, 15 Jul 2024 08:13:39 GMT
- Title: TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking
- Authors: Thuc Nguyen-Quang, Minh-Triet Tran,
- Abstract summary: Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
- Score: 6.91631684487121
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences. The emergence of data sets that emphasize robust reidentification, such as DanceTrack, has highlighted the need for effective solutions. While memory-based approaches have shown promise, they often suffer from high computational complexity and memory usage due to storing feature at every single frame. In this paper, we propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness, aiming to enhance efficiency while minimizing redundancy. As a result, our method not only store longer temporal information with limited number of stored features in the memory, but also diversify states of a particular object to enhance the association performance. Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
Related papers
- MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection [35.16197118579414]
We propose a multi-level aggregation architecture via memory bank called MAMBA.
Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods.
Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy.
arXiv Detail & Related papers (2024-01-18T12:13:06Z) - Transformer Network for Multi-Person Tracking and Re-Identification in
Unconstrained Environment [0.6798775532273751]
Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics.
We put forward an integrated MOT method that marries object detection and identity linkage within a singular, end-to-end trainable framework.
Our system leverages a robust memory-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator.
arXiv Detail & Related papers (2023-12-19T08:15:22Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - SpikeMOT: Event-based Multi-Object Tracking with Sparse Motion Features [52.213656737672935]
SpikeMOT is an event-based multi-object tracker.
SpikeMOT uses spiking neural networks to extract sparsetemporal features from event streams associated with objects.
arXiv Detail & Related papers (2023-09-29T05:13:43Z) - Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z) - Modeling Continuous Motion for 3D Point Cloud Object Tracking [54.48716096286417]
This paper presents a novel approach that views each tracklet as a continuous stream.
At each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank.
To enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed.
arXiv Detail & Related papers (2023-03-14T02:58:27Z) - MeMOT: Multi-Object Tracking with Memory [97.48960039220823]
Our model, called MeMOT, consists of three main modules that are all Transformer-based.
MeMOT observes very competitive performance on widely adopted MOT datasets.
arXiv Detail & Related papers (2022-03-31T02:33:20Z) - Learning Dynamic Compact Memory Embedding for Deformable Visual Object
Tracking [82.34356879078955]
We propose a compact memory embedding to enhance the discrimination of the segmentation-based deformable visual tracking method.
Our method outperforms the excellent segmentation-based trackers, i.e., D3S and SiamMask on DAVIS 2017 benchmark.
arXiv Detail & Related papers (2021-11-23T03:07:12Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.