Grounding-Prompter: Prompting LLM with Multimodal Information for
Temporal Sentence Grounding in Long Videos
- URL: http://arxiv.org/abs/2312.17117v1
- Date: Thu, 28 Dec 2023 16:54:21 GMT
- Title: Grounding-Prompter: Prompting LLM with Multimodal Information for
Temporal Sentence Grounding in Long Videos
- Authors: Houlun Chen, Xin Wang, Hong Chen, Zihan Song, Jia Jia, Wenwu Zhu
- Abstract summary: Temporal Sentence Grounding (TSG) aims to localize moments from videos based on the given natural language queries.
Existing works are mainly designed for short videos, failing to handle TSG in long videos.
We propose a Grounding-Prompter method, which is capable of conducting TSG in long videos through prompting LLM with multimodal information.
- Score: 42.32528440002539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Sentence Grounding (TSG), which aims to localize moments from videos
based on the given natural language queries, has attracted widespread
attention. Existing works are mainly designed for short videos, failing to
handle TSG in long videos, which poses two challenges: i) complicated contexts
in long videos require temporal reasoning over longer moment sequences, and ii)
multiple modalities including textual speech with rich information require
special designs for content understanding in long videos. To tackle these
challenges, in this work we propose a Grounding-Prompter method, which is
capable of conducting TSG in long videos through prompting LLM with multimodal
information. In detail, we first transform the TSG task and its multimodal
inputs including speech and visual, into compressed task textualization.
Furthermore, to enhance temporal reasoning under complicated contexts, a
Boundary-Perceptive Prompting strategy is proposed, which contains three folds:
i) we design a novel Multiscale Denoising Chain-of-Thought (CoT) to combine
global and local semantics with noise filtering step by step, ii) we set up
validity principles capable of constraining LLM to generate reasonable
predictions following specific formats, and iii) we introduce one-shot
In-Context-Learning (ICL) to boost reasoning through imitation, enhancing LLM
in TSG task understanding. Experiments demonstrate the state-of-the-art
performance of our Grounding-Prompter method, revealing the benefits of
prompting LLM with multimodal information for TSG in long videos.
Related papers
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [0.0]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - SpikeMba: Multi-Modal Spiking Saliency Mamba for Temporal Video Grounding [50.337896542603524]
We introduce SpikeMba: a multi-modal spiking saliency mamba for temporal video grounding.
Our approach integrates Spiking Neural Networks (SNNs) with state space models (SSMs) to leverage their unique advantages.
Our experiments demonstrate the effectiveness of SpikeMba, which consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2024-04-01T15:26:44Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - LSTP: Language-guided Spatial-Temporal Prompt Learning for Long-form
Video-Text Understanding [48.83009641950664]
We introduce a novel approach called Language-guided Spatial-Temporal Prompt Learning (LSTP)
This approach features two key components: a Temporal Prompt Sampler (TPS) with optical flow prior that leverages temporal information to efficiently extract relevant video content, and a Spatial Prompt solver (SPS) that adeptly captures the intricate spatial relationships between visual and textual elements.
By harmonizing TPS and SPS with a cohesive training strategy, our framework significantly enhances computational efficiency, temporal understanding, and spatial-temporal alignment.
arXiv Detail & Related papers (2024-02-25T10:27:46Z) - LLMs Meet Long Video: Advancing Long Video Comprehension with An
Interactive Visual Adapter in LLMs [24.79384819644494]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.