Related papers: Efficient Video Object Segmentation via Modulated Cross-Attention Memory

Efficient Video Object Segmentation via Modulated Cross-Attention Memory

URL: http://arxiv.org/abs/2403.17937v1
Date: Tue, 26 Mar 2024 17:59:58 GMT
Title: Efficient Video Object Segmentation via Modulated Cross-Attention Memory
Authors: Abdelrahman Shaker, Syed Talal Wasim, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan,
Abstract summary: We propose a transformer-based approach, named MAVOS, to model temporal smoothness without requiring frequent memory expansion. Our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU.
Score: 123.12273176475863
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation. However, these approaches typically struggle on long videos due to increased GPU memory demands, as they frequently expand the memory bank every few frames. We propose a transformer-based approach, named MAVOS, that introduces an optimized and dynamic long-term modulated cross-attention (MCA) memory to model temporal smoothness without requiring frequent memory expansion. The proposed MCA effectively encodes both local and global features at various levels of granularity while efficiently maintaining consistent speed regardless of the video length. Extensive experiments on multiple benchmarks, LVOS, Long-Time Video, and DAVIS 2017, demonstrate the effectiveness of our proposed contributions leading to real-time inference and markedly reduced memory demands without any degradation in segmentation accuracy on long videos. Compared to the best existing transformer-based approach, our MAVOS increases the speed by 7.6x, while significantly reducing the GPU memory by 87% with comparable segmentation performance on short and long video datasets. Notably on the LVOS dataset, our MAVOS achieves a J&F score of 63.3% while operating at 37 frames per second (FPS) on a single V100 GPU. Our code and models will be publicly available at: https://github.com/Amshaker/MAVOS.

Related papers

Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z)
Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory [5.311777874655448]
Long-Video Memory Network, Long-VMNet, is a novel video understanding method. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions.
arXiv Detail & Related papers (2025-03-17T20:25:41Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework. Our approach significantly reduces the memory footprint compared to standard gradient checkpointing. By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z)
VideoLLaMB: Long-context Video Understanding with Recurrent Memory Bridges [42.555895949250704]
VideoLLaMB is a novel framework that utilizes temporal memory tokens within bridge layers to allow for the encoding of entire video sequences. SceneTilling algorithm segments videos into independent semantic units to preserve semantic integrity. In terms of efficiency, VideoLLaMB, trained on 16 frames, supports up to 320 frames on a single Nvidia A100 GPU.
arXiv Detail & Related papers (2024-09-02T08:52:58Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)
READMem: Robust Embedding Association for a Diverse Memory in Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos. We propose a robust association of the embeddings stored in the memory with query embeddings during the update process. Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z)
Region Aware Video Object Segmentation with Deep Motion Modeling [56.95836951559529]
Region Aware Video Object (RAVOS) is a method that predicts regions of interest for efficient object segmentation and memory storage. For efficient segmentation, object features are extracted according to the ROIs, and an object decoder is designed for object-level segmentation. For efficient memory storage, we propose motion path memory to filter out redundant context by memorizing the features within the motion path of objects between two frames.
arXiv Detail & Related papers (2022-07-21T01:44:40Z)
Learning Quality-aware Dynamic Memory for Video Object Segmentation [32.06309833058726]
We propose a Quality-aware Dynamic Memory Network (QDMN) to evaluate the segmentation quality of each frame. Our QDMN achieves new state-of-the-art performance on both DAVIS and YouTube-VOS benchmarks.
arXiv Detail & Related papers (2022-07-16T12:18:04Z)
XMem: Long-Term Video Object Segmentation with an Atkinson-Shiffrin Memory Model [137.50614198301733]
We present XMem, a video object segmentation architecture for long videos with unified feature memory stores. We develop an architecture that incorporates multiple independent yet deeply-connected feature memory stores. XMem greatly exceeds state-of-the-art performance on long-video datasets.
arXiv Detail & Related papers (2022-07-14T17:59:37Z)
MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition [74.35009770905968]
We build a memory-augmented vision transformer that has a temporal support 30x longer than existing models. MeMViT obtains state-of-the-art results on the AVA, EPIC-Kitchens-100 action classification, and action anticipation datasets.
arXiv Detail & Related papers (2022-01-20T18:59:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.