ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
- URL: http://arxiv.org/abs/2511.14349v1
- Date: Tue, 18 Nov 2025 10:53:14 GMT
- Title: ARC-Chapter: Structuring Hour-Long Videos into Navigable Chapters and Hierarchical Summaries
- Authors: Junfu Pu, Teng Wang, Yixiao Ge, Yuying Ge, Chen Li, Ying Shan,
- Abstract summary: ARC-Chapter is the first large-scale video chaptering model trained on over million-level long video chapters.<n>It unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries.<n>It establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score.
- Score: 77.41072125938636
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The proliferation of hour-long videos (e.g., lectures, podcasts, documentaries) has intensified demand for efficient content structuring. However, existing approaches are constrained by small-scale training with annotations that are typical short and coarse, restricting generalization to nuanced transitions in long videos. We introduce ARC-Chapter, the first large-scale video chaptering model trained on over million-level long video chapters, featuring bilingual, temporally grounded, and hierarchical chapter annotations. To achieve this goal, we curated a bilingual English-Chinese chapter dataset via a structured pipeline that unifies ASR transcripts, scene texts, visual captions into multi-level annotations, from short title to long summaries. We demonstrate clear performance improvements with data scaling, both in data volume and label intensity. Moreover, we design a new evaluation metric termed GRACE, which incorporates many-to-one segment overlaps and semantic similarity, better reflecting real-world chaptering flexibility. Extensive experiments demonstrate that ARC-Chapter establishes a new state-of-the-art by a significant margin, outperforming the previous best by 14.0% in F1 score and 11.3% in SODA score. Moreover, ARC-Chapter shows excellent transferability, improving the state-of-the-art on downstream tasks like dense video captioning on YouCook2.
Related papers
- HiVid-Narrator: Hierarchical Video Narrative Generation with Scene-Primed ASR-anchored Compression [7.305586811678626]
We introduce the E-commerce Hierarchical Video Captioning dataset with dual-granularity, temporally grounded annotations.<n>We adopt a staged construction that first gathers reliable linguistic and visual evidence via curated ASR and frame-level descriptions.<n>We propose the Scene-Primed ASR-anchored Caption (SPA-Compressor), which compresses multimodal tokens into hierarchical scene and event representations guided by ASR semantic cues.
arXiv Detail & Related papers (2026-01-12T09:41:31Z) - Dense Video Captioning using Graph-based Sentence Summarization [80.52481563888459]
We propose a graph-based partition-and-summarization framework for dense video captioning.<n>We focus on the summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization.
arXiv Detail & Related papers (2025-06-25T16:23:43Z) - Chapter-Llama: Efficient Chaptering in Hour-Long Videos with LLMs [59.854331104466254]
We address the task of video chaptering, i.e., partitioning a long video timeline into semantic units and generating corresponding chapter titles.<n>We propose a lightweight speech-guided frame selection strategy based on speech transcript content, and experimentally demonstrate remarkable advantages.<n>Our results demonstrate substantial improvements over the state of the art on the recent VidChapters-7M benchmark.
arXiv Detail & Related papers (2025-03-31T17:41:29Z) - HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process.
We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Frame-wise Action Representations for Long Videos via Sequence
Contrastive Learning [44.412145665354736]
We introduce a novel contrastive action representation learning framework to learn frame-wise action representations.
Inspired by the recent progress of self-supervised learning, we present a novel sequence contrastive loss (SCL) applied on two correlated views.
Our approach also shows outstanding performance on video alignment and fine-grained frame retrieval tasks.
arXiv Detail & Related papers (2022-03-28T17:59:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.