MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
- URL: http://arxiv.org/abs/2307.15700v3
- Date: Wed, 21 Feb 2024 16:52:39 GMT
- Title: MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking
- Authors: Ruopeng Gao, Limin Wang
- Abstract summary: We propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking.
MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics.
Our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K.
- Score: 19.173503245000678
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As a video task, Multiple Object Tracking (MOT) is expected to capture
temporal information of targets effectively. Unfortunately, most existing
methods only explicitly exploit the object features between adjacent frames,
while lacking the capacity to model long-term temporal information. In this
paper, we propose MeMOTR, a long-term memory-augmented Transformer for
multi-object tracking. Our method is able to make the same object's track
embedding more stable and distinguishable by leveraging long-term memory
injection with a customized memory-attention layer. This significantly improves
the target association ability of our model. Experimental results on DanceTrack
show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and
13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also
outperforms other Transformer-based methods on association performance on MOT17
and generalizes well on BDD100K. Code is available at
https://github.com/MCG-NJU/MeMOTR.
Related papers
- Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Contrastive Learning for Multi-Object Tracking with Transformers [79.61791059432558]
We show how DETR can be turned into a MOT model by employing an instance-level contrastive loss.
Our training scheme learns object appearances while preserving detection capabilities and with little overhead.
Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset.
arXiv Detail & Related papers (2023-11-14T10:07:52Z) - Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object
Tracking [3.781471919731034]
Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices.
We propose to explore collaborative tracking learning (ColTrack) for frame-rate-insensitive MOT in a query-based end-to-end manner.
arXiv Detail & Related papers (2023-08-11T02:25:58Z) - MotionTrack: End-to-End Transformer-based Multi-Object Tracing with
LiDAR-Camera Fusion [13.125168307241765]
We propose an end-to-end transformer-based MOT algorithm (MotionTrack) with multi-modality sensor inputs to track objects with multiple classes.
The MotionTrack and its variations achieve better results (AMOTA score at 0.55) on the nuScenes dataset compared with other classical baseline models.
arXiv Detail & Related papers (2023-06-29T15:00:12Z) - DFR-FastMOT: Detection Failure Resistant Tracker for Fast Multi-Object
Tracking Based on Sensor Fusion [7.845528514468835]
Persistent multi-object tracking (MOT) allows autonomous vehicles to navigate safely in highly dynamic environments.
One of the well-known challenges in MOT is object occlusion when an object becomes unobservant for subsequent frames.
We propose DFR-FastMOT, a light MOT method that uses data from a camera and LiDAR sensors.
Our framework processes about 7,763 frames in 1.48 seconds, which is seven times faster than recent benchmarks.
arXiv Detail & Related papers (2023-02-28T17:57:06Z) - MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient
Long-Term Video Recognition [74.35009770905968]
We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models.
MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
arXiv Detail & Related papers (2022-01-20T18:59:54Z) - MOTR: End-to-End Multiple-Object Tracking with TRansformer [31.78906135775541]
We present MOTR, the first fully end-to-end multiple object tracking framework.
It learns to model the long-range temporal variation of the objects.
Results show that MOTR achieves state-of-the-art performance.
arXiv Detail & Related papers (2021-05-07T13:27:01Z) - TransMOT: Spatial-Temporal Graph Transformer for Multiple Object
Tracking [74.82415271960315]
We propose a solution named TransMOT to efficiently model the spatial and temporal interactions among objects in a video.
TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy.
The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20.
arXiv Detail & Related papers (2021-04-01T01:49:05Z) - Simultaneous Detection and Tracking with Motion Modelling for Multiple
Object Tracking [94.24393546459424]
We introduce Deep Motion Modeling Network (DMM-Net) that can estimate multiple objects' motion parameters to perform joint detection and association.
DMM-Net achieves PR-MOTA score of 12.80 @ 120+ fps for the popular UA-DETRAC challenge, which is better performance and orders of magnitude faster.
We also contribute a synthetic large-scale public dataset Omni-MOT for vehicle tracking that provides precise ground-truth annotations.
arXiv Detail & Related papers (2020-08-20T08:05:33Z) - DMV: Visual Object Tracking via Part-level Dense Memory and Voting-based
Retrieval [61.366644088881735]
We propose a novel memory-based tracker via part-level dense memory and voting-based retrieval, called DMV.
We also propose a novel voting mechanism for the memory reading to filter out unreliable information in the memory.
arXiv Detail & Related papers (2020-03-20T10:05:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.