READMem: Robust Embedding Association for a Diverse Memory in
Unconstrained Video Object Segmentation
- URL: http://arxiv.org/abs/2305.12823v2
- Date: Mon, 25 Sep 2023 13:36:44 GMT
- Title: READMem: Robust Embedding Association for a Diverse Memory in
Unconstrained Video Object Segmentation
- Authors: St\'ephane Vujasinovi\'c, Sebastian Bullinger, Stefan Becker, Norbert
Scherer-Negenborn, Michael Arens and Rainer Stiefelhagen
- Abstract summary: We present READMem, a modular framework for sVOS methods to handle unconstrained videos.
We propose a robust association of the embeddings stored in the memory with query embeddings during the update process.
Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
- Score: 24.813416082160224
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present READMem (Robust Embedding Association for a Diverse Memory), a
modular framework for semi-automatic video object segmentation (sVOS) methods
designed to handle unconstrained videos. Contemporary sVOS works typically
aggregate video frames in an ever-expanding memory, demanding high hardware
resources for long-term applications. To mitigate memory requirements and
prevent near object duplicates (caused by information of adjacent frames),
previous methods introduce a hyper-parameter that controls the frequency of
frames eligible to be stored. This parameter has to be adjusted according to
concrete video properties (such as rapidity of appearance changes and video
length) and does not generalize well. Instead, we integrate the embedding of a
new frame into the memory only if it increases the diversity of the memory
content. Furthermore, we propose a robust association of the embeddings stored
in the memory with query embeddings during the update process. Our approach
avoids the accumulation of redundant data, allowing us in return, to restrict
the memory size and prevent extreme memory demands in long videos. We extend
popular sVOS baselines with READMem, which previously showed limited
performance on long videos. Our approach achieves competitive results on the
Long-time Video dataset (LV1) while not hindering performance on short
sequences. Our code is publicly available.
Related papers
- LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame.
Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z) - XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin
Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores.
We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores.
XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z) - Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size.
We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos.
We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z) - Adaptive Memory Management for Video Object Segmentation [6.282068591820945]
A matching-based network stores every-k frames in an external memory bank for future inference.
The size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos.
This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features.
arXiv Detail & Related papers (2022-04-13T19:59:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.