Related papers: A Survey on Video Temporal Grounding with Multimodal Large Language Model

A Survey on Video Temporal Grounding with Multimodal Large Language Model

URL: http://arxiv.org/abs/2508.10922v1
Date: Thu, 07 Aug 2025 08:52:11 GMT
Title: A Survey on Video Temporal Grounding with Multimodal Large Language Model
Authors: Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, Chang Wen Chen,
Abstract summary: Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
Score: 107.24431595873808
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The recent advancement in video temporal grounding (VTG) has significantly enhanced fine-grained video understanding, primarily driven by multimodal large language models (MLLMs). With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods. They not only achieve competitive performance but also excel in generalization across zero-shot, multi-task, and multi-domain settings. Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce. To fill this gap, this survey systematically examines current research on VTG-MLLMs through a three-dimensional taxonomy: 1) the functional roles of MLLMs, highlighting their architectural significance; 2) training paradigms, analyzing strategies for temporal reasoning and task adaptation; and 3) video feature processing techniques, which determine spatiotemporal representation effectiveness. We further discuss benchmark datasets, evaluation protocols, and summarize empirical findings. Finally, we identify existing limitations and propose promising research directions. For additional resources and details, readers are encouraged to visit our repository at https://github.com/ki-lw/Awesome-MLLMs-for-Video-Temporal-Grounding.

Related papers

VidVec: Unlocking Video MLLM Embeddings for Video-Text Retrieval [11.519642157641023]
This paper focuses on leveraging MLLMs for video-text embedding and retrieval.<n>We first conduct a systematic layer-wise analysis, showing that intermediate (pre-trained) MLLM layers already encode substantial task-relevant information.<n>We demonstrate that combining intermediate-layer embeddings with a calibrated MLLM head yields strong zero-shot retrieval performance without any training.
arXiv Detail & Related papers (2026-02-08T19:39:32Z)
TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs [81.78017865436816]
We present TimeLens, a systematic investigation into building MLLMs with strong video temporal grounding ability.<n>We first expose critical quality issues in existing VTG benchmarks and introduce TimeLens-Bench, comprising meticulously re-annotated versions of three popular benchmarks.<n>We also address noisy training data through an automated re-annotation pipeline, yielding TimeLens-100K, a large-scale, high-quality training dataset.
arXiv Detail & Related papers (2025-12-16T18:59:58Z)
Enrich and Detect: Video Temporal Grounding with Multimodal LLMs [60.224522472631776]
We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.<n>Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video.<n>We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.
arXiv Detail & Related papers (2025-10-19T22:12:45Z)
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding [55.32878803528196]
Video temporal understanding is crucial for multimodal large language models (MLLMs) to reason over events in videos.<n>We propose MUSEG, a novel RL-based method that enhances temporal understanding by introducing timestamp-aware multi-segment grounding.<n>To facilitate effective learning, we design a customized RL training recipe with phased rewards that progressively guides the model toward temporally grounded reasoning.
arXiv Detail & Related papers (2025-05-27T04:50:07Z)
Keeping Yourself is Important in Downstream Tuning Multimodal Large Language Model [63.14883657299359]
Multi-modal Large Language Models (MLLMs) integrate visual and linguistic reasoning to address complex tasks such as image captioning and visual question answering.<n> tuning MLLMs for downstream tasks encounters two key challenges: Task-Expert, where distribution shifts between pre-training and target datasets constrain target performance, and OpenWorld Stabilization, where catastrophic forgetting erases the model general knowledge.
arXiv Detail & Related papers (2025-03-06T15:29:13Z)
Weakly Supervised Temporal Action Localization via Dual-Prior Collaborative Learning Guided by Multimodal Large Language Models [33.37379526356273]
We introduce a novel learning paradigm termed MLLM4WTAL.<n>It harnesses the potential of MLLM to offer temporal action key semantics and complete semantic priors.<n>It achieves this by integrating two distinct modules: Key Semantic Matching (KSM) and Complete Semantic Reconstruction (CSR)
arXiv Detail & Related papers (2024-11-13T09:37:24Z)
Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.<n>The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.<n>This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.