Related papers: TRACE: Temporal Grounding Video LLM via Causal Event Modeling

TRACE: Temporal Grounding Video LLM via Causal Event Modeling

URL: http://arxiv.org/abs/2410.05643v2
Date: Mon, 4 Nov 2024 08:58:14 GMT
Title: TRACE: Temporal Grounding Video LLM via Causal Event Modeling
Authors: Yongxin Guo, Jingyu Liu, Mingda Li, Xiaoying Tang, Qingbin Liu, Xi Chen,
Abstract summary: Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. Current video LLMs rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos. This paper introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. We propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice.
Score: 6.596327795743185
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video Temporal Grounding (VTG) is a crucial capability for video understanding models and plays a vital role in downstream tasks such as video browsing and editing. To effectively handle various tasks simultaneously and enable zero-shot prediction, there is a growing trend in employing video LLMs for VTG tasks. However, current video LLM-based methods rely exclusively on natural language generation, lacking the ability to model the clear structure inherent in videos, which restricts their effectiveness in tackling VTG tasks. To address this issue, this paper first formally introduces causal event modeling framework, which represents videos as sequences of events, and predict the current event using previous events, video inputs, and textural instructions. Each event consists of three components: timestamps, salient scores, and textual captions. We then propose a novel task-interleaved video LLM called TRACE to effectively implement the causal event modeling framework in practice. The TRACE processes visual frames, timestamps, salient scores, and text as distinct tasks, employing various encoders and decoding heads for each. Task tokens are arranged in an interleaved sequence according to the causal event modeling framework's formulation. Extensive experiments on various VTG tasks and datasets demonstrate the superior performance of TRACE compared to state-of-the-art video LLMs. Our model and code are available at \url{https://github.com/gyxxyg/TRACE}.

Related papers

Mind the Time: Temporally-Controlled Multi-Event Video Generation [65.05423863685866]
We present MinT, a multi-event video generator with temporal control. Our key insight is to bind each event to a specific period in the generated video, which allows the model to focus on one event at a time. For the first time in the literature, our model offers control over the timing of events in generated videos.
arXiv Detail & Related papers (2024-12-06T18:52:20Z)
Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs. Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z)
Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs. We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z)
VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding [10.548950058205833]
Video Temporal Grounding (VTG) strives to accurately pinpoint event timestamps in a specific video using linguistic queries. Video Large Language Models (video LLMs) can handle multiple tasks concurrently in a zero-shot manner. We introduce VTG-LLM, a model designed to enhance video LLMs' timestamp localization abilities.
arXiv Detail & Related papers (2024-05-22T06:31:42Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools. An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z)
VideoDirectorGPT: Consistent Multi-scene Video Generation via LLM-Guided Planning [62.51232333352754]
VideoDirectorGPT is a novel framework for consistent multi-scene video generation. Our proposed framework substantially improves layout and movement control in both single- and multi-scene video generation.
arXiv Detail & Related papers (2023-09-26T17:36:26Z)
Multi-event Video-Text Retrieval [33.470499262092105]
Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. We introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task.
arXiv Detail & Related papers (2023-08-22T16:32:46Z)
VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks. We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs. VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
LAVENDER: Unifying Video-Language Understanding as Masked Language Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z)
VIOLET : End-to-End Video-Language Transformers with Masked Visual-token Modeling [88.30109041658618]
A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data. We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
arXiv Detail & Related papers (2021-11-24T18:31:20Z)
UniVL: A Unified Video and Language Pre-Training Model for Multimodal Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation. It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone. We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.