Related papers: Memory-Efficient Continual Learning Object Segmentation for Long Video

Memory-Efficient Continual Learning Object Segmentation for Long Video

URL: http://arxiv.org/abs/2309.15274v2
Date: Wed, 14 Feb 2024 17:11:55 GMT
Title: Memory-Efficient Continual Learning Object Segmentation for Long Video
Authors: Amir Nazemi, Mohammad Javad Shafiee, Zahra Gharaee, Paul Fieguth
Abstract summary: We propose two novel techniques to reduce the memory requirement of Online VOS methods while improving modeling accuracy and generalization on long videos. Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL) and a Reconstruction-based Memory Selection Continual Learning (RMSCL) Experimental results show that the proposed methods are able to improve the performance of Online VOS models by more than 8%, with improved robustness on long-video datasets.
Score: 7.9190306016374485
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent state-of-the-art semi-supervised Video Object Segmentation (VOS) methods have shown significant improvements in target object segmentation accuracy when information from preceding frames is used in segmenting the current frame. In particular, such memory-based approaches can help a model to more effectively handle appearance changes (representation drift) or occlusions. Ideally, for maximum performance, Online VOS methods would need all or most of the preceding frames (or their extracted information) to be stored in memory and be used for online learning in later frames. Such a solution is not feasible for long videos, as the required memory size grows without bound, and such methods can fail when memory is limited and a target object experiences repeated representation drifts throughout a video. We propose two novel techniques to reduce the memory requirement of Online VOS methods while improving modeling accuracy and generalization on long videos. Motivated by the success of continual learning techniques in preserving previously-learned knowledge, here we propose Gated-Regularizer Continual Learning (GRCL), which improves the performance of any Online VOS subject to limited memory, and a Reconstruction-based Memory Selection Continual Learning (RMSCL), which empowers Online VOS methods to efficiently benefit from stored information in memory. We also analyze the performance of a hybrid combination of the two proposed methods. Experimental results show that the proposed methods are able to improve the performance of Online VOS models by more than 8%, with improved robustness on long-video datasets while maintaining comparable performance on short-video datasets such as DAVIS16, DAVIS17, and YouTube-VOS18.

Related papers

Memory-enhanced Retrieval Augmentation for Long Video Understanding [91.7163732531159]
We introduce a novel memory-enhanced RAG-based approach called MemVid.<n>Our approach operates in four basic steps: 1) memorizing holistic video information, 2) reasoning about the task's information needs based on memory, 3) retrieving critical moments based on the information needs, and 4) focusing on the retrieved moments to produce the final answer.<n>MemVid demonstrates superior efficiency and effectiveness compared to both LVLMs and RAG methods.
arXiv Detail & Related papers (2025-03-12T08:23:32Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z)
VideoPatchCore: An Effective Method to Memorize Normality for Video Anomaly Detection [1.9384004397336387]
Video anomaly detection (VAD) is a crucial task in video analysis and surveillance within computer vision. We propose an effective memory method for VAD, called VideoPatchCore. Our approach introduces a structure that prioritizes memory optimization and configures three types of memory tailored to the characteristics of video data.
arXiv Detail & Related papers (2024-09-24T16:38:41Z)
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences. SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion. Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
arXiv Detail & Related papers (2024-03-26T17:59:58Z)
Online Adaptation of Language Models with a Memory of Amortized Contexts [82.02369596879817]
Memory of Amortized Contexts (MAC) is an efficient and effective online adaptation framework for large language models. We show how MAC can be combined with and improve the performance of popular alternatives such as retrieval augmented generations.
arXiv Detail & Related papers (2024-03-07T08:34:57Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [54.90226700939778]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
Per-Clip Video Object Segmentation [110.08925274049409]
Recently, memory-based approaches show promising results on semisupervised video object segmentation. We treat video object segmentation as clip-wise mask-wise propagation. We propose a new method tailored for the per-clip inference.
arXiv Detail & Related papers (2022-08-03T09:02:29Z)
Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame. Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z)
Improving Task-free Continual Learning by Distributionally Robust Memory Evolution [9.345559196495746]
Task-free continual learning aims to learn a non-stationary data stream without explicit task definitions and not forget previous knowledge. Existing methods overlook the high uncertainty in the memory data distribution. We propose a principled memory evolution framework to dynamically evolve the memory data distribution.
arXiv Detail & Related papers (2022-07-15T02:16:09Z)
Adaptive Memory Management for Video Object Segmentation [6.282068591820945]
A matching-based network stores every-k frames in an external memory bank for future inference. The size of the memory bank gradually increases with the length of the video, which slows down inference speed and makes it impractical to handle arbitrary length videos. This paper proposes an adaptive memory bank strategy for matching-based networks for semi-supervised video object segmentation (VOS) that can handle videos of arbitrary length by discarding obsolete features.
arXiv Detail & Related papers (2022-04-13T19:59:07Z)
Video Object Segmentation with Episodic Graph Memory Networks [198.74780033475724]
A graph memory network is developed to address the novel idea of "learning to update the segmentation model" We exploit an episodic memory network, organized as a fully connected graph, to store frames as nodes and capture cross-frame correlations by edges. The proposed graph memory network yields a neat yet principled framework, which can generalize well both one-shot and zero-shot video object segmentation tasks.
arXiv Detail & Related papers (2020-07-14T13:19:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.