Related papers: TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors

URL: http://arxiv.org/abs/2601.02908v1
Date: Tue, 06 Jan 2026 10:45:53 GMT
Title: TA-Prompting: Enhancing Video Large Language Models for Dense Video Captioning via Temporal Anchors
Authors: Wei-Yuan Cheng, Kai-Po Chang, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang,
Abstract summary: Dense video captioning aims to interpret and describe all temporally localized events throughout an input video.<n>Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data.<n>We propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding.
Score: 40.48528326378281
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Dense video captioning aims to interpret and describe all temporally localized events throughout an input video. Recent state-of-the-art methods leverage large language models (LLMs) to provide detailed moment descriptions for video data. However, existing VideoLLMs remain challenging in identifying precise event boundaries in untrimmed videos, causing the generated captions to be not properly grounded. In this paper, we propose TA-Prompting, which enhances VideoLLMs via Temporal Anchors that learn to precisely localize events and prompt the VideoLLMs to perform temporal-aware video event understanding. During inference, in order to properly determine the output caption sequence from an arbitrary number of events presented within a video, we introduce an event coherent sampling strategy to select event captions with sufficient coherence across temporal events and cross-modal similarity with the given video. Through extensive experiments on benchmark datasets, we show that our TA-Prompting is favorable against state-of-the-art VideoLLMs, yielding superior performance on dense video captioning and temporal understanding tasks including moment retrieval and temporalQA.

Related papers

VideoPerceiver: Enhancing Fine-Grained Temporal Perception in Video Multimodal Large Language Models [9.896951371033229]
VideoPerceiver is a video multimodal large language model (VMLLM) that enhances fine-grained perception in video understanding.<n>We construct "key-information-missing" videos by extracting event-action keywords from captions, identifying corresponding key frames, and replacing them with adjacent frames.<n> Experiments show VideoPerceiver substantially outperforms state-of-the-art VMLLMs on fine-grained action understanding and rare event captioning benchmarks.
arXiv Detail & Related papers (2025-11-24T06:57:26Z)
Time-Scaling State-Space Models for Dense Video Captioning [29.405515743544687]
State-Space Models with Transfer State is a time-scaling model for dense video captioning.<n>It is suitable for generating captions on-the-fly, in an online or streaming manner, without having to wait for the full video to be processed.<n>When applied to dense video captioning, our approach scales well with video lengths and uses 7x fewer FLOPs.
arXiv Detail & Related papers (2025-09-03T15:56:20Z)
Controllable Hybrid Captioner for Improved Long-form Video Understanding [1.2035789357951119]
Video data is extremely dense and high-dimensional.<n>Text-based summaries of video content offer a way to represent content in a much more compact manner than raw.<n>We introduce Vision Language Models (VLMs) to enrich the memory with static scene descriptions.
arXiv Detail & Related papers (2025-07-22T22:09:00Z)
Universal Video Temporal Grounding with Generative Multi-modal Large Language Models [59.781211641591405]
This paper presents a computational model for universal video temporal grounding, which accurately localizes temporal moments in videos based on natural language queries.<n>We propose UniTime, a robust and universal video grounding model leveraging the strong vision-language understanding capabilities of generative Multi-modal Large Language Models (MLLMs)<n>Our model effectively handles videos of diverse views, genres, and lengths while comprehending complex language queries.
arXiv Detail & Related papers (2025-06-23T17:53:18Z)
Training-free Video Temporal Grounding using Large-scale Pre-trained Models [41.71055776623368]
Video temporal grounding aims to identify video segments within untrimmed videos that are most relevant to a given natural language query. Existing video temporal localization models rely on specific datasets for training and have high data collection costs. We propose a Training-Free Video Temporal Grounding approach that leverages the ability of pre-trained large models.
arXiv Detail & Related papers (2024-08-29T02:25:12Z)
EA-VTR: Event-Aware Video-Text Retrieval [97.30850809266725]
Event-Aware Video-Text Retrieval model achieves powerful video-text retrieval ability through superior video event awareness. EA-VTR can efficiently encode frame-level and video-level visual representations simultaneously, enabling detailed event content and complex event temporal cross-modal alignment.
arXiv Detail & Related papers (2024-07-10T09:09:58Z)
MLLM as Video Narrator: Mitigating Modality Imbalance in Video Moment Retrieval [53.417646562344906]
Video Moment Retrieval (VMR) aims to localize a specific temporal segment within an untrimmed long video given a natural language query. Existing methods often suffer from inadequate training annotations, i.e., the sentence typically matches with a fraction of the prominent video content in the foreground with limited wording diversity. This intrinsic modality imbalance leaves a considerable portion of visual information remaining unaligned with text. In this work, we take an MLLM as a video narrator to generate plausible textual descriptions of the video, thereby mitigating the modality imbalance and boosting the temporal localization.
arXiv Detail & Related papers (2024-06-25T18:39:43Z)
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale. We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence. We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z)
Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS) To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.