TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- URL: http://arxiv.org/abs/2411.18211v1
- Date: Wed, 27 Nov 2024 10:45:40 GMT
- Title: TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- Authors: Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma,
- Abstract summary: We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization.
TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos.
It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos.
- Score: 26.376975842846235
- License:
- Abstract: Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization. TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos. It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos. Additionally, TimeMarker utilizes diverse datasets, including further transformed temporal-related video QA datasets, to bolster its temporal understanding capabilities. Image and interleaved data are also employed to further enhance the model's semantic perception ability. Evaluations demonstrate that TimeMarker achieves state-of-the-art performance across multiple benchmarks, excelling in both short and long video categories. Our project page is at \url{https://github.com/TimeMarker-LLM/TimeMarker/}.
Related papers
- Fine-grained Video-Text Retrieval: A New Benchmark and Method [25.2967056489715]
We present FIBER, a FIne-grained BEnchmark for text to video Retrieval, containing 1,000 videos sourced from FineAction dataset.
Uniquely, our FIBER benchmark provides detailed human-annotated spatial annotations and temporal annotations for each video.
Experiment results show that our Video Large Language (VLLE) performs comparably to CLIP-based models on traditional benchmarks.
arXiv Detail & Related papers (2024-12-31T15:53:50Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.
Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - ReVisionLLM: Recursive Vision-Language Model for Temporal Grounding in Hour-Long Videos [25.988212332357545]
ReVisionLLM is a vision-language model designed to locate events in hour-long videos.
Inspired by human search strategies, our model initially targets broad segments of interest.
Our model can seamlessly handle videos of vastly different lengths, from minutes to hours.
arXiv Detail & Related papers (2024-11-22T12:46:50Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding.
Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - TemporalMaxer: Maximize Temporal Context with only Max Pooling for
Temporal Action Localization [52.234877003211814]
We introduce TemporalMaxer, which minimizes long-term temporal context modeling while maximizing information from the extracted video clip features.
We demonstrate that TemporalMaxer outperforms other state-of-the-art methods that utilize long-term temporal context modeling.
arXiv Detail & Related papers (2023-03-16T03:11:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.