Related papers: Video Understanding with Large Language Models: A Survey

Video Understanding with Large Language Models: A Survey

URL: http://arxiv.org/abs/2312.17432v4
Date: Wed, 24 Jul 2024 04:44:11 GMT
Title: Video Understanding with Large Language Models: A Survey
Authors: Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, Chenliang Xu,
Abstract summary: Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding. The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning. This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
Score: 97.29126722004949
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding that harness the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity (general, temporal, and spatiotemporal) reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into three main types: Video Analyzer x LLM, Video Embedder x LLM, and (Analyzer + Embedder) x LLM. Furthermore, we identify five sub-types based on the functions of LLMs in Vid-LLMs: LLM as Summarizer, LLM as Manager, LLM as Text Decoder, LLM as Regressor, and LLM as Hidden Layer. Furthermore, this survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.

Related papers

VideoSSR: Video Self-Supervised Reinforcement Learning [62.25888935329454]
Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs)<n>This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data?
arXiv Detail & Related papers (2025-11-09T08:36:40Z)
A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z)
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling. We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos. Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
Survey of different Large Language Model Architectures: Trends, Benchmarks, and Challenges [15.850548556536538]
Large Language Models (LLMs) represent a class of deep learning models adept at understanding natural language. An advanced subset of these models, known as Multimodal Large Language Models (MLLMs), extends LLM capabilities to process and interpret multiple data modalities. This survey provides a comprehensive overview of the recent advancements in LLMs.
arXiv Detail & Related papers (2024-12-04T11:14:06Z)
Harnessing LLMs for Automated Video Content Analysis: An Exploratory Workflow of Short Videos on Depression [8.640838598568605]
We conduct a case study that followed a new workflow of Large Language Models (LLMs)-assisted multimodal content analysis. To test LLM's video annotation capabilities, we analyzed 203s extracted from 25 short videos about depression.
arXiv Detail & Related papers (2024-06-27T21:03:56Z)
MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding [67.56182262082729]
We introduce MMBench-Video, a quantitative benchmark to rigorously evaluate large vision-language models (LVLMs) in video understanding. MMBench-Video incorporates lengthy videos from YouTube and employs free-form questions, mirroring practical use cases. The benchmark is meticulously crafted to probe the models' temporal reasoning skills, with all questions human-annotated according to a carefully constructed ability taxonomy.
arXiv Detail & Related papers (2024-06-20T17:26:01Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information. We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs. Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z)
ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation. How to effectively encode and understand videos in video-based dialogue systems remains to be solved. We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z)
Understanding Long Videos with Multimodal Language Models [44.78900245769057]
Large Language Models (LLMs) have allowed recent approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of underlying LLMs influence this strong performance. Our resulting Multimodal Video Understanding framework demonstrates state-of-the-art performance across multiple video understanding benchmarks.
arXiv Detail & Related papers (2024-03-25T17:59:09Z)
LLM4VG: Large Language Models Evaluation for Video Grounding [39.40610479454726]
This paper systematically evaluates the performance of different LLMs on video grounding tasks. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models.
arXiv Detail & Related papers (2023-12-21T08:15:02Z)
Towards Vision Enhancing LLMs: Empowering Multimodal Knowledge Storage and Sharing in LLMs [72.49064988035126]
We propose an approach called MKS2, aimed at enhancing multimodal large language models (MLLMs) Specifically, we introduce the Modular Visual Memory, a component integrated into the internal blocks of LLMs, designed to store open-world visual information efficiently. Our experiments demonstrate that MKS2 substantially augments the reasoning capabilities of LLMs in contexts necessitating physical or commonsense knowledge.
arXiv Detail & Related papers (2023-11-27T12:29:20Z)
VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs. VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.