Enhancing Temporal Modeling of Video LLMs via Time Gating
- URL: http://arxiv.org/abs/2410.05714v1
- Date: Tue, 8 Oct 2024 06:21:29 GMT
- Title: Enhancing Temporal Modeling of Video LLMs via Time Gating
- Authors: Zi-Yuan Hu, Yiwu Zhong, Shijia Huang, Michael R. Lyu, Liwei Wang,
- Abstract summary: Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering.
Most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding.
We propose a Time Gating Video LLM (TG-Vid) designed to enhance temporal modeling through a novel Time Gating module (TG)
- Score: 38.86742466948778
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Large Language Models (Video LLMs) have achieved impressive performance on video-and-language tasks, such as video question answering. However, most existing Video LLMs neglect temporal information in video data, leading to struggles with temporal-aware video understanding. To address this gap, we propose a Time Gating Video LLM (TG-Vid) designed to enhance temporal modeling through a novel Time Gating module (TG). The TG module employs a time gating mechanism on its sub-modules, comprising gating spatial attention, gating temporal attention, and gating MLP. This architecture enables our model to achieve a robust understanding of temporal information within videos. Extensive evaluation of temporal-sensitive video benchmarks (i.e., MVBench, TempCompass, and NExT-QA) demonstrates that our TG-Vid model significantly outperforms the existing Video LLMs. Further, comprehensive ablation studies validate that the performance gains are attributed to the designs of our TG module. Our code is available at https://github.com/LaVi-Lab/TG-Vid.
Related papers
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.
We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.
Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.
Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.
We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z) - ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos [25.988212332357545]
ReVisionLLM is a vision-language model designed to locate events in hour-long videos.
Inspired by human search strategies, our model initially targets broad segments of interest.
Our model can seamlessly handle videos of vastly different lengths, from minutes to hours.
arXiv Detail & Related papers (2024-11-22T12:46:50Z) - TimeSuite: Improving MLLMs for Long Video Understanding via Grounded Tuning [42.928144657587325]
This paper proposes TimeSuite, a collection of new designs to adapt the existing short-form video MLLMs for long video understanding.
TimeSuite provides a successful solution to enhance the long video understanding capability of short-form MLLM.
In addition, we introduce the TimePro, a comprehensive grounding-centric instruction dataset composed of 9 tasks and 349k high-quality grounded annotations.
arXiv Detail & Related papers (2024-10-25T17:19:55Z) - VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding [10.548950058205833]
Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries.
Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner.
We introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities.
arXiv Detail & Related papers (2024-05-22T06:31:42Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - TempCompass: Do Video LLMs Really Understand Videos? [36.28973015469766]
Existing benchmarks fail to provide a comprehensive feedback on the temporal perception ability of Video LLMs.
We propose the textbfTemp benchmark, which introduces a diversity of high-quality temporal aspects and task formats.
arXiv Detail & Related papers (2024-03-01T12:02:19Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.