Number it: Temporal Grounding Videos like Flipping Manga
- URL: http://arxiv.org/abs/2411.10332v1
- Date: Fri, 15 Nov 2024 16:32:34 GMT
- Title: Number it: Temporal Grounding Videos like Flipping Manga
- Authors: Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, Xu Yang,
- Abstract summary: Number-Prompt (NumPro) is a method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding.
Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence.
Experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost.
- Score: 45.50403831692172
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Large Language Models (Vid-LLMs) have made remarkable advancements in comprehending video content for QA dialogue. However, they struggle to extend this visual understanding to tasks requiring precise temporal localization, known as Video Temporal Grounding (VTG). To address this gap, we introduce Number-Prompt (NumPro), a novel method that empowers Vid-LLMs to bridge visual comprehension with temporal grounding by adding unique numerical identifiers to each video frame. Treating a video as a sequence of numbered frame images, NumPro transforms VTG into an intuitive process: flipping through manga panels in sequence. This allows Vid-LLMs to "read" event timelines, accurately linking visual content with corresponding temporal information. Our experiments demonstrate that NumPro significantly boosts VTG performance of top-tier Vid-LLMs without additional computational cost. Furthermore, fine-tuning on a NumPro-enhanced dataset defines a new state-of-the-art for VTG, surpassing previous top-performing methods by up to 6.9\% in mIoU for moment retrieval and 8.5\% in mAP for highlight detection. The code will be available at https://github.com/yongliang-wu/NumPro.
Related papers
- TimeExpert: An Expert-Guided Video LLM for Video Temporal Grounding [83.96715649130435]
We introduce TimeExpert, a Mixture-of-Experts (MoE)-based Video-LLM that effectively decomposes VTG tasks.<n>Our design choices enable precise handling of each subtask, leading to improved event modeling across diverse VTG applications.
arXiv Detail & Related papers (2025-08-03T10:03:58Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance [44.08446730529495]
We propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation.
Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short.
arXiv Detail & Related papers (2024-11-04T17:50:36Z) - Open-Vocabulary Action Localization with Iterative Visual Prompting [8.07285448283823]
Video action localization aims to find the timings of specific actions from a long video.
This paper proposes a training-free, open-vocry approach based on emerging vision-language models (VLMs)
We extend an iterative visual prompting technique to identify the frames that most likely correspond to the start and end of the action.
arXiv Detail & Related papers (2024-08-30T17:12:14Z) - VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding [7.907951246007355]
Video Temporal Grounding (VTG) focuses on accurately identifying event timestamps within a particular video based on a linguistic query.
Video Large Language Models (video LLMs) have made significant progress in understanding video content, but they often face challenges in accurately pinpointing timestamps within videos.
We propose a specially designed video LLM model for VTG tasks, VTG-LLM, which effectively integrates timestamp knowledge into visual tokens.
arXiv Detail & Related papers (2024-05-22T06:31:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Spatio-temporal Prompting Network for Robust Video Feature Extraction [74.54597668310707]
Frametemporal is one of the main challenges in the field of video understanding.
Recent approaches exploit transformer-based integration modules to obtain quality-of-temporal information.
We present a neat and unified framework called N-Temporal Prompting Network (NNSTP)
It can efficiently extract video features by adjusting the input features in the network backbone.
arXiv Detail & Related papers (2024-02-04T17:52:04Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - ControlVideo: Conditional Control for One-shot Text-driven Video Editing
and Beyond [45.188722895165505]
ControlVideo generates a video that aligns with a given text while preserving the structure of the source video.
Building on a pre-trained text-to-image diffusion model, ControlVideo enhances the fidelity and temporal consistency.
arXiv Detail & Related papers (2023-05-26T17:13:55Z) - VicTR: Video-conditioned Text Representations for Activity Recognition [73.09929391614266]
We argue that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information.
We introduce Video-conditioned Text Representations (VicTR), a form of text embeddings optimized w.r.t. visual embeddings.
Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text.
arXiv Detail & Related papers (2023-04-05T16:30:36Z) - VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.