Memory-Efficient Continual Learning Object Segmentation for Long Video
- URL: http://arxiv.org/abs/2309.15274v2
- Date: Wed, 14 Feb 2024 17:11:55 GMT
- Title: Memory-Efficient Continual Learning Object Segmentation for Long Video
- Authors: Amir Nazemi, Mohammad Javad Shafiee, Zahra Gharaee, Paul Fieguth
- Abstract summary: We propose two novel techniques to reduce the memory requirement of Online VOS methods while improving modeling accuracy and generalization on long videos.
Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL) and a Reconstruction-based Memory Selection Continual Learning (RMSCL)
Experimental results show that the proposed methods are able to improve the performance of Online VOS models by more than 8%, with improved robustness on long-video datasets.
- Score: 7.9190306016374485
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent state-of-the-art semi-supervised Video Object Segmentation (VOS)
methods have shown significant improvements in target object segmentation
accuracy when information from preceding frames is used in segmenting the
current frame. In particular, such memory-based approaches can help a model to
more effectively handle appearance changes (representation drift) or
occlusions. Ideally, for maximum performance, Online VOS methods would need all
or most of the preceding frames (or their extracted information) to be stored
in memory and be used for online learning in later frames. Such a solution is
not feasible for long videos, as the required memory size grows without bound,
and such methods can fail when memory is limited and a target object
experiences repeated representation drifts throughout a video. We propose two
novel techniques to reduce the memory requirement of Online VOS methods while
improving modeling accuracy and generalization on long videos. Motivated by the
success of continual learning techniques in preserving previously-learned
knowledge, here we propose Gated-Regularizer Continual Learning (GRCL), which
improves the performance of any Online VOS subject to limited memory, and a
Reconstruction-based Memory Selection Continual Learning (RMSCL), which
empowers Online VOS methods to efficiently benefit from stored information in
memory. We also analyze the performance of a hybrid combination of the two
proposed methods. Experimental results show that the proposed methods are able
to improve the performance of Online VOS models by more than 8%, with improved
robustness on long-video datasets while maintaining comparable performance on
short-video datasets such as DAVIS16, DAVIS17, and YouTube-VOS18.
Related papers
- ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention.
For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z) - VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection [1.9384004397336387]
Video anomaly detection (VAD) is a crucial task in video analysis and surveillance within computer vision.
We propose an effective memory method for VAD, called VideoPatchCore.
Our approach introduces a structure that prioritizes memory optimization and configures three types of memory tailored to the characteristics of video data.
arXiv Detail & Related papers (2024-09-24T16:38:41Z) - VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences.
SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity.
In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z) - Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion.
Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z) - Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models.
We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders
Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.
We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation.
We treat video object segmentation as clip-wise mask-wise propagation.
We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z) - Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame.
Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z) - Adaptive Memory Management for Video Object Segmentation [6.282068591820945]
A matching-based network stores every-k frames in an external memory bank for future inference.
The size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos.
This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features.
arXiv Detail & Related papers (2022-04-13T19:59:07Z) - Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model"
We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges.
The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.