Related papers: MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding

URL: http://arxiv.org/abs/2404.05726v2
Date: Wed, 24 Apr 2024 15:38:48 GMT
Title: MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Authors: Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, Ser-Nam Lim,
Abstract summary: This study focuses on designing an efficient and effective model for long-term video understanding. We propose to process videos in an online manner and store past video information in a memory bank. Our model can achieve state-of-the-art performances across multiple datasets.
Score: 66.56100008577134
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/.

Related papers

FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding [70.56829394569938]
We propose Frame Selection Augmented Generation (FRAG) to process long inputs without long context LMMs. The core of the selection process is done by scoring each frame independently, which does not require long context processing. We show that FRAG consistently improves the performance and achieves state-of-the-art performances for both long video and long document understanding.
arXiv Detail & Related papers (2025-04-24T11:19:18Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling. We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos. Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
AIM: Adaptive Inference of Multi-Modal LLMs via Token Merging and Pruning [19.68349294206012]
We propose a training-free adaptive inference method for multi-modal LLMs. With a minimalist design, our method can be applied to both video and image LLMs. Under a similar computational cost, our method outperforms the state-of-the-art methods in long video understanding.
arXiv Detail & Related papers (2024-12-04T11:47:57Z)
Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [26.72068455284472]
Video-XL is an extra-long vision language model designed for efficient hour-scale video understanding. Our model achieves promising results on popular long video understanding benchmarks.
arXiv Detail & Related papers (2024-09-22T15:13:31Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
Understanding Long Videos with Multimodal Language Models [44.78900245769057]
Large Language Models (LLMs) have allowed recent approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z)
Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding. The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning. This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
OneLLM: One Framework to Align All Modalities with Language [90.14915575477197]
We present OneLLM, an MLLM that aligns eight modalities to language using a unified framework. OneLLM is evaluated on 25 diverse benchmarks, encompassing tasks such as multimodal captioning, question answering and reasoning.
arXiv Detail & Related papers (2023-12-06T18:59:19Z)
VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. We evaluate various baseline methods with and without large-scale VidL pre-training. The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.