Related papers: HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression

URL: http://arxiv.org/abs/2601.07366v1
Date: Mon, 12 Jan 2026 09:41:31 GMT
Title: HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression
Authors: Haoxuan Li, Mengyan Li, Junjun Zheng,
Abstract summary: We introduce the E-commerce Hierarchical Video Captioning dataset with dual-granularity, temporally grounded annotations.<n>We adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions.<n>We propose the Scene-Primed ASR-anchored Caption (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues.
Score: 7.305586811678626
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating structured narrations for real-world e-commerce videos requires models to perceive fine-grained visual details and organize them into coherent, high-level stories--capabilities that existing approaches struggle to unify. We introduce the E-commerce Hierarchical Video Captioning (E-HVC) dataset with dual-granularity, temporally grounded annotations: a Temporal Chain-of-Thought that anchors event-level observations and Chapter Summary that compose them into concise, story-centric summaries. Rather than directly prompting chapters, we adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions, then refines coarse annotations into precise chapter boundaries and titles conditioned on the Temporal Chain-of-Thought, yielding fact-grounded, time-aligned narratives. We also observe that e-commerce videos are fast-paced and information-dense, with visual tokens dominating the input sequence. To enable efficient training while reducing input tokens, we propose the Scene-Primed ASR-anchored Compressor (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues. Built upon these designs, our HiVid-Narrator framework achieves superior narrative quality with fewer input tokens compared to existing methods.

Related papers

Token Reduction via Local and Global Contexts Optimization for Efficient Video Large Language Models [61.11154533305096]
Video Large Language Models (VLLMs) demonstrate strong video understanding but suffer from inefficiency due to redundant visual tokens.<n>We propose a new perspective that elaborates token textbfAnchors within intra-frame and inter-frame contexts.<n>Our proposed AOT obtains competitive performances across various short- and long-video benchmarks on leading video LLMs.
arXiv Detail & Related papers (2026-03-02T03:06:40Z)
TimeChat-Captioner: Scripting Multi-Scene Videos with Time-Aware and Structural Audio-Visual Captions [64.27159505605312]
This paper proposes Omni Captioning, a novel task designed to generate continuous, fine-grained, and structured audio-visual narratives with explicit timestamps.<n>To ensure dense semantic coverage, we introduce a six-dimensional structural schema to create "script-like" captions.<n>Extensive experiments demonstrate that TimeChat-Captioner-7B achieves state-of-the-art performance, surpassing Gemini-2.5-Pro.
arXiv Detail & Related papers (2026-02-09T14:21:58Z)
STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z)
ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries [77.41072125938636]
ARC-Chapter is the first large-scale video chaptering model trained on over million-level long video chapters.<n>It unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries.<n>It establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score.
arXiv Detail & Related papers (2025-11-18T10:53:14Z)
Identity-Preserving Text-to-Video Generation Guided by Simple yet Effective Spatial-Temporal Decoupled Representations [131.33758144860988]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>Current end-to-end frameworks suffer a critical spatial-temporal trade-off.<n>We propose a simple yet effective spatial-temporal decoupled framework that decomposes representations into spatial features for layouts and temporal features for motion dynamics.
arXiv Detail & Related papers (2025-07-07T06:54:44Z)
SceneRAG: Scene-level Retrieval-Augmented Generation for Video Understanding [6.980340270823506]
We present SceneRAG, a framework to segment videos into narrative-consistent scenes.<n>For each scene, the framework fuses information from both visual and textual modalities to extract entity relations.<n>Experiments on the LongerVideos benchmark, featuring over 134 hours of diverse content, confirm that SceneRAG substantially outperforms prior baselines.
arXiv Detail & Related papers (2025-06-09T10:00:54Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space.<n>Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Bridging Vision and Language: Modeling Causality and Temporality in Video Narratives [0.0]
We propose an enhanced framework that integrates a Causal-Temporal Reasoning Module into state-of-the-art LVLMs.<n>CTRM comprises two key components: the Causal Dynamics (CDE) and the Temporal Learner (TRL)<n>We design a multi-stage learning strategy to optimize the model, combining pre-training on large-scale video-text datasets.
arXiv Detail & Related papers (2024-12-14T07:28:38Z)
NarrativeBridge: Enhancing Video Captioning with Causal-Temporal Narrative [19.79736018383692]
Existing video captioning benchmarks and models lack causal-temporal narrative.<n>This lack of narrative restricts models' ability to generate text descriptions that capture the causal and temporal dynamics inherent in video content.<n>We propose NarrativeBridge, an approach comprising of: (1) a novel Causal-Temporal Narrative (CTN) captions benchmark generated using a large language model and few-shot prompting; and (2) a Cause-Effect Network (CEN) with separate encoders for capturing cause and effect dynamics.
arXiv Detail & Related papers (2024-06-10T17:34:24Z)
Leveraging Temporal Contextualization for Video Action Recognition [47.8361303269338]
We propose a framework for video understanding called Temporally Contextualized CLIP (TC-CLIP) We introduce Temporal Contextualization (TC), a layer-wise temporal information infusion mechanism for videos. The Video-Prompting (VP) module processes context tokens to generate informative prompts in the text modality.
arXiv Detail & Related papers (2024-04-15T06:24:56Z)
Screenplay Summarization Using Latent Narrative Structure [78.45316339164133]
We propose to explicitly incorporate the underlying structure of narratives into general unsupervised and supervised extractive summarization models. We formalize narrative structure in terms of key narrative events (turning points) and treat it as latent in order to summarize screenplays. Experimental results on the CSI corpus of TV screenplays, which we augment with scene-level summarization labels, show that latent turning points correlate with important aspects of a CSI episode.
arXiv Detail & Related papers (2020-04-27T11:54:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.