Grounding-Prompter: Prompting LLM with Multimodal Information for
Temporal Sentence Grounding in Long Videos
- URL: http://arxiv.org/abs/2312.17117v1
- Date: Thu, 28 Dec 2023 16:54:21 GMT
- Title: Grounding-Prompter: Prompting LLM with Multimodal Information for
Temporal Sentence Grounding in Long Videos
- Authors: Houlun Chen, Xin Wang, Hong Chen, Zihan Song, Jia Jia, Wenwu Zhu
- Abstract summary: Temporal Sentence Grounding (TSG) aims to localize moments from videos based on the given natural language queries.
Existing works are mainly designed for short videos, failing to handle TSG in long videos.
We propose a Grounding-Prompter method, which is capable of conducting TSG in long videos through prompting LLM with multimodal information.
- Score: 42.32528440002539
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Temporal Sentence Grounding (TSG), which aims to localize moments from videos
based on the given natural language queries, has attracted widespread
attention. Existing works are mainly designed for short videos, failing to
handle TSG in long videos, which poses two challenges: i) complicated contexts
in long videos require temporal reasoning over longer moment sequences, and ii)
multiple modalities including textual speech with rich information require
special designs for content understanding in long videos. To tackle these
challenges, in this work we propose a Grounding-Prompter method, which is
capable of conducting TSG in long videos through prompting LLM with multimodal
information. In detail, we first transform the TSG task and its multimodal
inputs including speech and visual, into compressed task textualization.
Furthermore, to enhance temporal reasoning under complicated contexts, a
Boundary-Perceptive Prompting strategy is proposed, which contains three folds:
i) we design a novel Multiscale Denoising Chain-of-Thought (CoT) to combine
global and local semantics with noise filtering step by step, ii) we set up
validity principles capable of constraining LLM to generate reasonable
predictions following specific formats, and iii) we introduce one-shot
In-Context-Learning (ICL) to boost reasoning through imitation, enhancing LLM
in TSG task understanding. Experiments demonstrate the state-of-the-art
performance of our Grounding-Prompter method, revealing the benefits of
prompting LLM with multimodal information for TSG in long videos.
Related papers
- TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding.
TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM.
In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models [53.9661582975843]
Video Temporal Grounding aims to ground specific segments within an untrimmed video corresponding to a given natural language query.
Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive and prone to human biases.
We present ChatVTG, a novel approach that utilizes Video Dialogue Large Language Models (LLMs) for zero-shot video temporal grounding.
arXiv Detail & Related papers (2024-10-01T08:27:56Z) - Learning to Localize Actions in Instructional Videos with LLM-Based Multi-Pathway Text-Video Alignment [53.12952107996463]
This work proposes a novel training framework for learning to localize temporal boundaries of procedure steps in training videos.
Motivated by the strong capabilities of Large Language Models (LLMs) in procedure understanding and text summarization, we first apply an LLM to filter out task-irrelevant information and summarize task-related procedure steps from narrations.
To further generate reliable pseudo-matching between the LLM-steps and the video for training, we propose the Multi-Pathway Text-Video Alignment (MPTVA) strategy.
arXiv Detail & Related papers (2024-09-22T18:40:55Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.