TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- URL: http://arxiv.org/abs/2411.18211v1
- Date: Wed, 27 Nov 2024 10:45:40 GMT
- Title: TimeMarker: A Versatile Video-LLM for Long and Short Video Understanding with Superior Temporal Localization Ability
- Authors: Shimin Chen, Xiaohan Lan, Yitian Yuan, Zequn Jie, Lin Ma,
- Abstract summary: We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization.<n>TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos.<n>It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos.
- Score: 26.376975842846235
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Rapid development of large language models (LLMs) has significantly advanced multimodal large language models (LMMs), particularly in vision-language tasks. However, existing video-language models often overlook precise temporal localization and struggle with videos of varying lengths. We introduce TimeMarker, a versatile Video-LLM designed for high-quality dialogue based on video content, emphasizing temporal localization. TimeMarker integrates Temporal Separator Tokens to enhance temporal awareness, accurately marking specific moments within videos. It employs the AnyLength mechanism for dynamic frame sampling and adaptive token merging, enabling effective handling of both short and long videos. Additionally, TimeMarker utilizes diverse datasets, including further transformed temporal-related video QA datasets, to bolster its temporal understanding capabilities. Image and interleaved data are also employed to further enhance the model's semantic perception ability. Evaluations demonstrate that TimeMarker achieves state-of-the-art performance across multiple benchmarks, excelling in both short and long video categories. Our project page is at \url{https://github.com/TimeMarker-LLM/TimeMarker/}.
Related papers
- VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding [48.745013691038295]
VideoExpert is a general-purpose MLLM suitable for several temporal-sensitive video tasks.
The Temporal Expert is responsible for modeling time sequences and performing temporal grounding.
The Spatial Expert focuses on content detail analysis and instruction following.
By offloading temporal grounding from content generation, VideoExpert prevents text pattern biases in timestamp predictions.
arXiv Detail & Related papers (2025-04-10T07:33:39Z) - Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs.
We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z) - Video LLMs for Temporal Reasoning in Long Videos [7.2900856926028155]
TemporalVLM is a video large language model capable of effective temporal reasoning and fine-grained understanding in long videos.
Our approach includes a visual encoder for mapping a long-term input video into features which are time-aware and contain both local and global cues.
To facilitate the evaluation of TemporalVLM, we present a large-scale long video dataset of industry assembly processes, namely IndustryASM.
arXiv Detail & Related papers (2024-12-04T00:50:33Z) - MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query.
Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity.
This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text.
In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z) - Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA [40.21221568678641]
Long-form videos that span across wide temporal intervals are highly information redundant.
All information necessary to generate a correct response can often be contained within a small subset of frames.
Recent literature explore use of large language models in LVQA benchmarks, achieving exceptional performance.
arXiv Detail & Related papers (2024-06-13T17:59:16Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LITA: Language Instructed Temporal-Localization Assistant [71.68815100776278]
We introduce time tokens that encode timestamps relative to the video length to better represent time in videos.
We also introduce SlowFast tokens in the architecture to capture temporal information at fine temporal resolution.
We show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs.
arXiv Detail & Related papers (2024-03-27T22:50:48Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding.
Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.