MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection
- URL: http://arxiv.org/abs/2401.09923v2
- Date: Thu, 1 Feb 2024 18:43:06 GMT
- Title: MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection
- Authors: Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson
- Abstract summary: We propose a multi-level aggregation architecture via memory bank called MAMBA.
Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods.
Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy.
- Score: 35.16197118579414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art video object detection methods maintain a memory structure,
either a sliding window or a memory queue, to enhance the current frame using
attention mechanisms. However, we argue that these memory structures are not
efficient or sufficient because of two implied operations: (1) concatenating
all features in memory for enhancement, leading to a heavy computational cost;
(2) frame-wise memory updating, preventing the memory from capturing more
temporal information. In this paper, we propose a multi-level aggregation
architecture via memory bank called MAMBA. Specifically, our memory bank
employs two novel operations to eliminate the disadvantages of existing
methods: (1) light-weight key-set construction which can significantly reduce
the computational cost; (2) fine-grained feature-wise updating strategy which
enables our method to utilize knowledge from the whole video. To better enhance
features from complementary levels, i.e., feature maps and proposals, we
further propose a generalized enhancement operation (GEO) to aggregate
multi-level features in a unified manner. We conduct extensive evaluations on
the challenging ImageNetVID dataset. Compared with existing state-of-the-art
methods, our method achieves superior performance in terms of both speed and
accuracy. More remarkably, MAMBA achieves mAP of 83.7/84.6% at 12.6/9.1 FPS
with ResNet-101. Code is available at
https://github.com/guanxiongsun/vfe.pytorch.
Related papers
- Efficiently Training 7B LLM with 1 Million Sequence Length on 8 GPUs [24.066283519769968]
Large Language Models (LLMs) have been trained using extended context lengths to foster more creative applications.
We propose MEMO, a novel framework for fine-grained activation memory management.
We show that MEMO achieves an average of 2.42x and 2.26x MFU compared to Megatron-LM and DeepSpeed.
arXiv Detail & Related papers (2024-07-16T18:59:49Z) - B'MOJO: Hybrid State Space Realizations of Foundation Models with Eidetic and Fading Memory [91.81390121042192]
We develop a class of models called B'MOJO to seamlessly combine eidetic and fading memory within an composable module.
B'MOJO's ability to modulate eidetic and fading memory results in better inference on longer sequences tested up to 32K tokens.
arXiv Detail & Related papers (2024-07-08T18:41:01Z) - TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z) - MeMSVD: Long-Range Temporal Structure Capturing Using Incremental SVD [27.472705540825316]
This paper is on long-term video understanding where the goal is to recognise human actions over long temporal windows (up to minutes long)
We propose an alternative to attention-based schemes which is based on a low-rank approximation of the memory obtained using Singular Value Decomposition.
Our scheme has two advantages: (a) it reduces complexity by more than an order of magnitude, and (b) it is amenable to an efficient implementation for the calculation of the memory bases.
arXiv Detail & Related papers (2024-06-11T12:03:57Z) - Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z) - RMM: Reinforced Memory Management for Class-Incremental Learning [102.20140790771265]
Class-Incremental Learning (CIL) trains classifiers under a strict memory budget.
Existing methods use a static and ad hoc strategy for memory allocation, which is often sub-optimal.
We propose a dynamic memory management strategy that is optimized for the incremental phases and different object classes.
arXiv Detail & Related papers (2023-01-14T00:07:47Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Multi-Scale Memory-Based Video Deblurring [34.488707652997704]
We design a memory branch to memorize the blurry-sharp feature pairs in the memory bank.
To enrich the memory of our memory bank, we also designed a bidirectional recurrency and multi-scale strategy.
Experimental results demonstrate that our model outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2022-04-06T08:48:56Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.