Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
- URL: http://arxiv.org/abs/2510.17023v1
- Date: Sun, 19 Oct 2025 22:12:45 GMT
- Title: Enrich and Detect: Video Temporal Grounding with Multimodal LLMs
- Authors: Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, Triantafyllos Afouras,
- Abstract summary: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models.<n>Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video.<n>We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings.
- Score: 60.224522472631776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce ED-VTG, a method for fine-grained video temporal grounding utilizing multi-modal large language models. Our approach harnesses the capabilities of multimodal LLMs to jointly process text and video, in order to effectively localize natural language queries in videos through a two-stage process. Rather than being directly grounded, language queries are initially transformed into enriched sentences that incorporate missing details and cues to aid in grounding. In the second stage, these enriched queries are grounded, using a lightweight decoder, which specializes at predicting accurate boundaries conditioned on contextualized representations of the enriched queries. To mitigate noise and reduce the impact of hallucinations, our model is trained with a multiple-instance-learning objective that dynamically selects the optimal version of the query for each training sample. We demonstrate state-of-the-art results across various benchmarks in temporal video grounding and paragraph grounding settings. Experiments reveal that our method significantly outperforms all previously proposed LLM-based temporal grounding approaches and is either superior or comparable to specialized models, while maintaining a clear advantage against them in zero-shot evaluation scenarios.
Related papers
- A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z) - TSPO: Temporal Sampling Policy Optimization for Long-form Video Language Understanding [25.675553077419274]
Multimodal Language Models (MLLMs) have demonstrated significant progress in vision tasks, yet they still face challenges when processing long-duration inputs.<n>We propose Temporal Sampling Policy Optimization (TSPO), advancing MLLMs' long-form video-language understanding via reinforcement learning.<n>Our TSPO state-of-the-art performance across multiple long video understanding benchmarks, and shows transferable ability across different cutting-edge Video-MLLMs.
arXiv Detail & Related papers (2025-08-06T12:03:36Z) - Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query.
Existing video temporal localization models rely on specific datasets for training and have high data collection costs.
We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z) - VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - Aligning Language Models with Demonstrated Feedback [58.834937450242975]
Demonstration ITerated Task Optimization (DITTO) directly aligns language model outputs to a user's demonstrated behaviors.<n>We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts.
arXiv Detail & Related papers (2024-06-02T23:13:56Z) - Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Language-free Training for Zero-shot Video Grounding [50.701372436100684]
Video grounding aims to localize the time interval by understanding the text and video simultaneously.
One of the most challenging issues is an extremely time- and cost-consuming annotation collection.
We present a simple yet novel training framework for video grounding in the zero-shot setting.
arXiv Detail & Related papers (2022-10-24T06:55:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.