Related papers: Hierarchical Memory for Long Video QA

Hierarchical Memory for Long Video QA

URL: http://arxiv.org/abs/2407.00603v1
Date: Sun, 30 Jun 2024 06:08:12 GMT
Title: Hierarchical Memory for Long Video QA
Authors: Yiqin Wang, Haoji Zhang, Yansong Tang, Yong Liu, Jiashi Feng, Jifeng Dai, Xiaojie Jin,
Abstract summary: This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA) We adopt a hierarchical memory mechanism named STAR Memory, that is capable of processing long videos with limited GPU memory (VRAM) We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge.
Score: 78.72965584414368
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper describes our champion solution to the LOVEU Challenge @ CVPR'24, Track 1 (Long Video VQA). Processing long sequences of visual tokens is computationally expensive and memory-intensive, making long video question-answering a challenging task. The key is to compress visual tokens effectively, reducing memory footprint and decoding latency, while preserving the essential information for accurate question-answering. We adopt a hierarchical memory mechanism named STAR Memory, proposed in Flash-VStream, that is capable of processing long videos with limited GPU memory (VRAM). We further utilize the video and audio data of MovieChat-1K training set to fine-tune the pretrained weight released by Flash-VStream, achieving 1st place in the challenge. Code is available at project homepage https://invinciblewyq.github.io/vstream-page

Related papers

Long-VMNet: Accelerating Long-Form Video Understanding via Fixed Memory [5.311777874655448]
Long-Video Memory Network, Long-VMNet, is a novel video understanding method. Long-VMNet achieves improved efficiency by leveraging a neural sampler that identifies discriminative tokens. Our results on the Rest-ADL dataset demonstrate an 18x -- 75x improvement in inference times for long-form video retrieval and answering questions.
arXiv Detail & Related papers (2025-03-17T20:25:41Z)
ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
Extending Video Masked Autoencoders to 128 frames [75.01251612160829]
Video understanding has witnessed significant progress with recent video foundation models demonstrating strong performance owing to self-supervised pre-training objectives; Masked Autoencoders (MAE) being the design of choice. However, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding. We propose an effective strategy for prioritizing tokens which allows training on longer video sequences (128 frames) and gets better performance than, more typical, random
arXiv Detail & Related papers (2024-11-20T20:00:38Z)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos. We leverage DINOv2 features to remove redundant frames that exhibit high similarity. We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z)
Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding. It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected. Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z)
MovieChat+: Question-aware Sparse Memory for Long Video Question Answering [36.14140811797466]
We propose MovieChat to overcome the challenges of understanding long videos. We use tokens in Transformers as the carriers of memory in combination with our specially designed memory mechanism. MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video, 2K temporal grounding labels, and 14K manual annotations for validation of the effectiveness of our method.
arXiv Detail & Related papers (2024-04-26T06:17:04Z)
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding [38.504994472886786]
MovieChat achieves state-of-the-art performance in long video understanding, along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations.
arXiv Detail & Related papers (2023-07-31T07:15:45Z)
READMem: Robust Embedding Association for a Diverse Memory in Unconstrained Video Object Segmentation [24.813416082160224]
We present READMem, a modular framework for sVOS methods to handle unconstrained videos. We propose a robust association of the embeddings stored in the memory with query embeddings during the update process. Our approach achieves competitive results on the Long-time Video dataset (LV1) while not hindering performance on short sequences.
arXiv Detail & Related papers (2023-05-22T08:31:16Z)
EVEREST: Efficient Masked Video Autoencoder by Removing Redundant Spatiotemporal Tokens [57.354304637367555]
We present EVEREST, a surprisingly efficient MVA approach for video representation learning. It finds tokens containing rich motion features and discards uninformative ones during both pre-training and fine-tuning. Our method significantly reduces the computation and memory requirements of MVA.
arXiv Detail & Related papers (2022-11-19T09:57:01Z)
Recurrent Dynamic Embedding for Video Object Segmentation [54.52527157232795]
We propose a Recurrent Dynamic Embedding (RDE) to build a memory bank of constant size. We propose an unbiased guidance loss during the training stage, which makes SAM more robust in long videos. We also design a novel self-correction strategy so that the network can repair the embeddings of masks with different qualities in the memory bank.
arXiv Detail & Related papers (2022-05-08T02:24:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.