Recurrent Dynamic Embedding for Video Object Segmentation
- URL: http://arxiv.org/abs/2205.03761v1
- Date: Sun, 8 May 2022 02:24:43 GMT
- Title: Recurrent Dynamic Embedding for Video Object Segmentation
- Authors: Mingxing Li, Li Hu, Zhiwei Xiong, Bang Zhang, Pan Pan, Dong Liu
- Abstract summary: We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
- Score: 54.52527157232795
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Space-time memory (STM) based video object segmentation (VOS) networks
usually keep increasing memory bank every several frames, which shows excellent
performance. However, 1) the hardware cannot withstand the ever-increasing
memory requirements as the video length increases. 2) Storing lots of
information inevitably introduces lots of noise, which is not conducive to
reading the most important information from the memory bank. In this paper, we
propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant
size. Specifically, we explicitly generate and update RDE by the proposed
Spatio-temporal Aggregation Module (SAM), which exploits the cue of historical
information. To avoid error accumulation owing to the recurrent usage of SAM,
we propose an unbiased guidance loss during the training stage, which makes SAM
more robust in long videos. Moreover, the predicted masks in the memory bank
are inaccurate due to the inaccurate network inference, which affects the
segmentation of the query frame. To address this problem, we design a novel
self-correction strategy so that the network can repair the embeddings of masks
with different qualities in the memory bank. Extensive experiments show our
method achieves the best tradeoff between performance and speed. Code is
available at https://github.com/Limingxing00/RDE-VOS-CVPR2022.
Related papers
- ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - RMem: Restricted Memory Banks Improve Video Object Segmentation [26.103189475763998]
Video object segmentation (VOS) benchmarks are evolving to challenging scenarios.
We revisit a simple but overlooked strategy: restricting the size of memory banks.
By restricting memory banks to a limited number of essential frames, we achieve a notable improvement in VOS accuracy.
arXiv Detail & Related papers (2024-06-12T17:59:04Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - READMem: Robust Embedding Association for a Diverse Memory in
Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos.
We propose a robust association of the embeddings stored in the memory with query embeddings during the update process.
Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z) - Robust and Efficient Memory Network for Video Object Segmentation [6.7995672846437305]
This paper proposes a Robust and Efficient Memory Network, or REMN, for studying semi-supervised video object segmentation (VOS)
We introduce a local attention mechanism that tackles the background distraction by enhancing the features of foreground objects with the previous mask.
Experiments demonstrate that our REMN achieves state-of-the-art results on DAVIS 2017, with a $mathcalJ&F$ score of 86.3% and on YouTube-VOS 2018, with a $mathcalG$ over mean of 85.5%.
arXiv Detail & Related papers (2023-04-24T06:19:21Z) - Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame.
Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin
Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores.
We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores.
XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z) - Memformer: A Memory-Augmented Transformer for Sequence Modeling [55.780849185884996]
We present Memformer, an efficient neural network for sequence modeling.
Our model achieves linear time complexity and constant memory space complexity when processing long sequences.
arXiv Detail & Related papers (2020-10-14T09:03:36Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.