Robust and Efficient Memory Network for Video Object Segmentation
- URL: http://arxiv.org/abs/2304.11840v1
- Date: Mon, 24 Apr 2023 06:19:21 GMT
- Title: Robust and Efficient Memory Network for Video Object Segmentation
- Authors: Yadang Chen, Dingwei Zhang, Zhi-xin Yang, Enhua Wu
- Abstract summary: This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
- Score: 6.7995672846437305
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes a Robust and Efficient Memory Network, referred to as
REMN, for studying semi-supervised video object segmentation (VOS).
Memory-based methods have recently achieved outstanding VOS performance by
performing non-local pixel-wise matching between the query and memory. However,
these methods have two limitations. 1) Non-local matching could cause
distractor objects in the background to be incorrectly segmented. 2) Memory
features with high temporal redundancy consume significant computing resources.
For limitation 1, we introduce a local attention mechanism that tackles the
background distraction by enhancing the features of foreground objects with the
previous mask. For limitation 2, we first adaptively decide whether to update
the memory features depending on the variation of foreground objects to reduce
temporal redundancy. Second, we employ a dynamic memory bank, which uses a
lightweight and differentiable soft modulation gate to decide how many memory
features need to be removed in the temporal dimension. Experiments demonstrate
that our REMN achieves state-of-the-art results on DAVIS 2017, with a
$\mathcal{J\&F}$ score of 86.3% and on YouTube-VOS 2018, with a $\mathcal{G}$
over mean of 85.5%. Furthermore, our network shows a high inference speed of
25+ FPS and uses relatively few computing resources.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - TF-SASM: Training-free Spatial-aware Sparse Memory for Multi-object Tracking [6.91631684487121]
Multi-object tracking (MOT) in computer vision remains a significant challenge, requiring precise localization and continuous tracking of multiple objects in video sequences.
We propose a novel memory-based approach that selectively stores critical features based on object motion and overlapping awareness.
Our approach significantly improves over MOTRv2 in the DanceTrack test set, demonstrating a gain of 2.0% AssA score and 2.1% in IDF1 score.
arXiv Detail & Related papers (2024-07-05T07:55:19Z) - MAMBA: Multi-level Aggregation via Memory Bank for Video Object
Detection [35.16197118579414]
We propose a multi-level aggregation architecture via memory bank called MAMBA.
Specifically, our memory bank employs two novel operations to eliminate the disadvantages of existing methods.
Compared with existing state-of-the-art methods, our method achieves superior performance in terms of both speed and accuracy.
arXiv Detail & Related papers (2024-01-18T12:13:06Z) - Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage.
For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation.
For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z) - Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame.
Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - Rethinking Space-Time Networks with Improved Memory Coverage for
Efficient Video Object Segmentation [68.45737688496654]
We establish correspondences directly between frames without re-encoding the mask features for every object.
With the correspondences, every node in the current query frame is inferred by aggregating features from the past in an associative fashion.
We validated that every memory node now has a chance to contribute, and experimentally showed that such diversified voting is beneficial to both memory efficiency and inference accuracy.
arXiv Detail & Related papers (2021-06-09T16:50:57Z) - Dual Temporal Memory Network for Efficient Video Object Segmentation [42.05305410986511]
One of the fundamental challenges in Video Object (VOS) is how to make the most use of the temporal information to boost the performance.
We present an end-to-end network which stores short- and long-term video sequence information preceding the current frame as the temporal memories.
Our network consists of two temporal sub-networks including a short-term memory sub-network and a long-term memory sub-network.
arXiv Detail & Related papers (2020-03-13T06:07:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.