Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
- URL: http://arxiv.org/abs/2503.08335v1
- Date: Tue, 11 Mar 2025 11:47:48 GMT
- Title: Prompt2LVideos: Exploring Prompts for Understanding Long-Form Multimodal Videos
- Authors: Soumya Shamarao Jahagirdar, Jayasree Saha, C V Jawahar,
- Abstract summary: This paper introduces a dataset comprising long-form lectures and news videos.<n>Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content.<n>We advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
- Score: 27.678107532472275
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Learning multimodal video understanding typically relies on datasets comprising video clips paired with manually annotated captions. However, this becomes even more challenging when dealing with long-form videos, lasting from minutes to hours, in educational and news domains due to the need for more annotators with subject expertise. Hence, there arises a need for automated solutions. Recent advancements in Large Language Models (LLMs) promise to capture concise and informative content that allows the comprehension of entire videos by leveraging Automatic Speech Recognition (ASR) and Optical Character Recognition (OCR) technologies. ASR provides textual content from audio, while OCR extracts textual content from specific frames. This paper introduces a dataset comprising long-form lectures and news videos. We present baseline approaches to understand their limitations on this dataset and advocate for exploring prompt engineering techniques to comprehend long-form multimodal video datasets comprehensively.
Related papers
- DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering [13.466266412068475]
We introduce the DocVideoQA task and dataset for the first time, comprising 1454 videos across 23 categories with a total duration of about 828 hours.
The dataset is annotated with 154k question-answer pairs generated manually and via GPT, assessing models' comprehension, temporal awareness, and modality integration capabilities.
Our method enhances unimodal feature extraction with diverse instruction-tuning data and employs contrastive learning to strengthen modality integration.
arXiv Detail & Related papers (2025-03-20T06:21:25Z) - VideoRAG: Retrieval-Augmented Generation over Video Corpus [57.68536380621672]
VideoRAG is a framework that dynamically retrieves videos based on their relevance with queries.<n>VideoRAG is powered by recent Large Video Language Models (LVLMs)<n>We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
arXiv Detail & Related papers (2025-01-10T11:17:15Z) - Towards Long Video Understanding via Fine-detailed Video Story Generation [58.31050916006673]
Long video understanding has become a critical task in computer vision, driving advancements across numerous applications from surveillance to content retrieval.<n>Existing video understanding methods suffer from two challenges when dealing with long video understanding: intricate long-context relationship modeling and interference from redundancy.<n>We introduce Fine-Detailed Video Story generation (FDVS), which interprets long videos into detailed textual representations.
arXiv Detail & Related papers (2024-12-09T03:41:28Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment [42.557643515992005]
Video understanding remains a challenge despite the availability of substantial web video-text data.
We introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding.
arXiv Detail & Related papers (2024-05-22T18:35:10Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.