Related papers: TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions

URL: http://arxiv.org/abs/2602.08711v2
Date: Thu, 12 Feb 2026 06:06:38 GMT
Title: TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions
Authors: Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, Qi Liu, Pengfei Wan, Kun Gai, Yuanxing Zhang, Xu Sun,
Abstract summary: This paper proposes Omni Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps.<n>To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions.<n>Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro.
Score: 64.27159505605312
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper proposes Omni Dense Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps. To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions, enabling readers to vividly imagine the video content scene by scene, akin to a cinematographic screenplay. To facilitate research, we construct OmniDCBench, a high-quality, human-annotated benchmark, and propose SodaM, a unified metric that evaluates time-aware detailed descriptions while mitigating scene boundary ambiguity. Furthermore, we construct a training dataset, TimeChatCap-42K, and present TimeChat-Captioner-7B, a strong baseline trained via SFT and GRPO with task-specific rewards. Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro, while its generated dense descriptions significantly boost downstream capabilities in audio-visual reasoning (DailyOmni and WorldSense) and temporal grounding (Charades-STA). All datasets, models, and code will be made publicly available at https://github.com/yaolinli/TimeChat-Captioner.

Related papers

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression [7.305586811678626]
We introduce the E-commerce Hierarchical Video Captioning dataset with dual-granularity, temporally grounded annotations.<n>We adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions.<n>We propose the Scene-Primed ASR-anchored Caption (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues.
arXiv Detail & Related papers (2026-01-12T09:41:31Z)
TextVidBench: A Benchmark for Long Video Scene Text Understanding [60.94150574231576]
We introduce TextVidBench, the first benchmark specifically designed for long-video text question answering (>3 minutes)<n>TextVidBench makes three key contributions: Spanning 9 categories (e.g., news, sports, gaming), with an average video length of 2306 seconds, enabling more realistic evaluation of long-video understanding.<n>We propose an efficient paradigm for improving large models through: (i) introducing the IT-Rope mechanism and temporal prompt engineering to enhance temporal perception, (ii) adopting non-uniform positional encoding to better handle long video sequences, and (iii) applying lightweight fine-tuning on
arXiv Detail & Related papers (2025-06-05T12:54:56Z)
DisTime: Distribution-based Time Representation for Video Large Language Models [23.176698643825123]
DisTime is a lightweight framework designed to enhance temporal comprehension in Video-LLMs.<n>DisTime employs a learnable token to create a continuous temporal embedding space.<n>DisTime achieves state-of-the-art performance across benchmarks in three time-sensitive tasks.
arXiv Detail & Related papers (2025-05-30T08:10:18Z)
AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.<n>We implement the token merging strategy, reducing the number of input visual tokens.<n>AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z)
TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding [20.037781644877388]
TimeChat is a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame, and (2) a sliding video Q-Former that produces a video token sequence of varying lengths.
arXiv Detail & Related papers (2023-12-04T17:09:52Z)
Exploiting Auxiliary Caption for Video Grounding [66.77519356911051]
Video grounding aims to locate a moment of interest matching a given query sentence from an untrimmed video. Previous works ignore the sparsity dilemma in video annotations, which fails to provide the context information between potential events and query sentences in the dataset. We propose an Auxiliary Caption Network (ACNet) for video grounding. Specifically, we first introduce dense video captioning to generate dense captions and then obtain auxiliary captions by Non-Auxiliary Caption Suppression (NACS) To capture the potential information in auxiliary captions, we propose Caption Guided Attention (CGA) project the semantic relations between auxiliary captions and
arXiv Detail & Related papers (2023-01-15T02:04:02Z)
HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z)
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z)
Spoken Moments: Learning Joint Audio-Visual Representations from Video Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events. We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.