MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection
- URL: http://arxiv.org/abs/2401.09923v2
- Date: Thu, 1 Feb 2024 18:43:06 GMT
- Title: MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection
- Authors: Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson
- Abstract summary: We propose a multi-level aggregation architecture via memory bank called MAMBA.
Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods.
Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy.
- Score: 35.16197118579414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art video object detection methods maintain a memory structure,
either a sliding window or a memory queue, to enhance the current frame using
attention mechanisms. However, we argue that these memory structures are not
efficient or sufficient because of two implied operations: (1) concatenating
all features in memory for enhancement, leading to a heavy computational cost;
(2) frame-wise memory updating, preventing the memory from capturing more
temporal information. In this paper, we propose a multi-level aggregation
architecture via memory bank called MAMBA. Specifically, our memory bank
employs two novel operations to eliminate the disadvantages of existing
methods: (1) light-weight key-set construction which can significantly reduce
the computational cost; (2) fine-grained feature-wise updating strategy which
enables our method to utilize knowledge from the whole video. To better enhance
features from complementary levels, i.e., feature maps and proposals, we
further propose a generalized enhancement operation (GEO) to aggregate
multi-level features in a unified manner. We conduct extensive evaluations on
the challenging ImageNetVID dataset. Compared with existing state-of-the-art
methods, our method achieves superior performance in terms of both speed and
accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS
with ResNet-101. Code is available at
https://github.com/guanxiongsun/vfe.pytorch.
Related papers
- Memory Layers at Scale [67.00854080570979]
This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale.
On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the budget, as well as mixture-of-expert models when matched for both compute and parameters.
We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.
arXiv Detail & Related papers (2024-12-12T23:56:57Z) - Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.
Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.
By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z) - MEMO: Fine-grained Tensor Management For Ultra-long Context LLM Training [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.
We propose MEMO, a novel framework for fine-grained activation memory management.
MeMO achieves an average of 1.97x and 1.80x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z) - MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z) - RMM: Reinforced Memory Management for Class-Incremental Learning [102.20140790771265]
Class-Incremental Learning (CIL) trains classifiers under a strict memory budget.
Existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal.
We propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes.
arXiv Detail & Related papers (2023-01-14T00:07:47Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.