Video LLMs for Temporal Reasoning in Long Videos
- URL: http://arxiv.org/abs/2412.02930v1
- Date: Wed, 04 Dec 2024 00:50:33 GMT
- Title: Video LLMs for Temporal Reasoning in Long Videos
- Authors: Fawad Javed Fateh, Umer Ahmed, Hamza Khan, M. Zeeshan Zia, Quoc-Huy Tran,
- Abstract summary: TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.
Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.
- Score: 7.2900856926028155
- License:
- Abstract: This paper introduces TemporalVLM, a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos. At the core, our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues. In particular, it first divides the input video into short-term clips, which are jointly encoded with their timestamps into time-sensitive local features. Next, the local features are passed through a bidirectional long short-term memory module for global feature aggregation. The extracted time-aware and multi-level features are important for accurate temporal reasoning and fine-grained understanding in long videos. Moreover, to facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM, which consists of videos recorded on factory floors with actions and timestamps annotated by industrial engineers for time and motion studies and temporal action segmentation evaluation. Finally, extensive experiments on datasets of long videos, including TimeIT and IndustryASM, show that TemporalVLM achieves superior performance than previous methods across temporal reasoning and fine-grained understanding tasks, namely dense video captioning, temporal video grounding, video highlight detection, and temporal action segmentation.
Related papers
- Temporal Preference Optimization for Long-Form Video Understanding [28.623353303256653]
Temporal Preference Optimization (TPO) is a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs.
TPO significantly enhances temporal understanding while reducing reliance on manually annotated data.
LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark.
arXiv Detail & Related papers (2025-01-23T18:58:03Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability [26.376975842846235]
We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization.
TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos.
It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos.
arXiv Detail & Related papers (2024-11-27T10:45:40Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models [53.235170710385006]
We introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner.
We sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge.
In experiments, Grounded-VideoLLM excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA.
arXiv Detail & Related papers (2024-10-04T10:04:37Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization [52.234877003211814]
We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
arXiv Detail & Related papers (2023-03-16T03:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.