HawkEye: Training Video-Text LLMs for Grounding Text in Videos
- URL: http://arxiv.org/abs/2403.10228v1
- Date: Fri, 15 Mar 2024 11:58:18 GMT
- Title: HawkEye: Training Video-Text LLMs for Grounding Text in Videos
- Authors: Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, Dongyan Zhao,
- Abstract summary: We propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner.
To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans.
We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives.
- Score: 44.870165050047355
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Video-text Large Language Models (video-text LLMs) have shown remarkable performance in answering questions and holding conversations on simple videos. However, they perform almost the same as random on grounding text queries in long and complicated videos, having little ability to understand and reason about temporal information, which is the most fundamental difference between videos and images. In this paper, we propose HawkEye, one of the first video-text LLMs that can perform temporal video grounding in a fully text-to-text manner. To collect training data that is applicable for temporal video grounding, we construct InternVid-G, a large-scale video-text corpus with segment-level captions and negative spans, with which we introduce two new time-aware training objectives to video-text LLMs. We also propose a coarse-grained method of representing segments in videos, which is more robust and easier for LLMs to learn and follow than other alternatives. Extensive experiments show that HawkEye is better at temporal video grounding and comparable on other video-text tasks with existing video-text LLMs, which verifies its superior video-text multi-modal understanding abilities.
Related papers
- 2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining [86.76706820098867]
We introduce a high-quality textbfmultimodal textbook corpus with richer foundational knowledge for VLM pretraining.
It collects over 2.5 years of instructional videos, totaling 22,000 class hours.
Compared to its counterparts, our video-centric textbook offers more coherent context, richer knowledge, and better image-text alignment.
arXiv Detail & Related papers (2025-01-01T21:29:37Z) - SCBench: A Sports Commentary Benchmark for Video LLMs [19.13963551534595]
We develop a benchmark for sports video commentary generation for Video Large Language Models (Video LLMs)
$textbfSCBench$ is a six-dimensional metric specifically designed for our task, upon which we propose a GPT-based evaluation method.
Our results found InternVL-Chat-2 achieves the best performance with 5.44, surpassing the second-best by 1.04.
arXiv Detail & Related papers (2024-12-23T15:13:56Z) - TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment [42.557643515992005]
Video understanding remains a challenge despite the availability of substantial web video-text data.
We introduce Text-Only Pre-Alignment (TOPA), a novel approach to extend large language models (LLMs) for video understanding.
arXiv Detail & Related papers (2024-05-22T18:35:10Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
and Generation [90.71796406228265]
InternVid is a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations.
The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words.
arXiv Detail & Related papers (2023-07-13T17:58:32Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z) - Learning Transferable Spatiotemporal Representations from Natural Script
Knowledge [65.40899722211726]
We introduce a new pretext task, Turning to Video Transcript for ASR (TVTS), which sorts scripts by attending to learned video representations.
The advantages enable our model to contextualize what is happening like human beings and seamlessly apply to large-scale uncurated video data in the real world.
arXiv Detail & Related papers (2022-09-30T07:39:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.