Related papers: Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

URL: http://arxiv.org/abs/2502.16427v2
Date: Mon, 07 Jul 2025 04:25:02 GMT
Title: Fine-Grained Captioning of Long Videos through Scene Graph Consolidation
Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han,
Abstract summary: We introduce a novel framework for long video captioning based on graph consolidation.<n>Our approach first generates segment-level captions, corresponding to individual frames or short video intervals.<n>A lightweight graph-to-text decoder then produces the final video-level caption.
Score: 44.30028794237688
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in vision-language models have led to impressive progress in caption generation for images and short video clips. However, these models remain constrained by their limited temporal receptive fields, making it difficult to produce coherent and comprehensive captions for long videos. While several methods have been proposed to aggregate information across video segments, they often rely on supervised fine-tuning or incur significant computational overhead. To address these challenges, we introduce a novel framework for long video captioning based on graph consolidation. Our approach first generates segment-level captions, corresponding to individual frames or short video intervals, using off-the-shelf visual captioning models. These captions are then parsed into individual scene graphs, which are subsequently consolidated into a unified graph representation that preserves both holistic context and fine-grained details throughout the video. A lightweight graph-to-text decoder then produces the final video-level caption. This framework effectively extends the temporal understanding capabilities of existing models without requiring any additional fine-tuning on long video datasets. Experimental results show that our method significantly outperforms existing LLM-based consolidation approaches, achieving strong zero-shot performance while substantially reducing computational costs.

Related papers

Dense Video Captioning using Graph-based Sentence Summarization [80.52481563888459]
We propose a graph-based partition-and-summarization framework for dense video captioning.<n>We focus on the summarization" stage, and propose a framework that effectively exploits the relationship between semantic words for summarization.
arXiv Detail & Related papers (2025-06-25T16:23:43Z)
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models [48.00262713744499]
VideoComp is a benchmark and learning framework for advancing video-text compositionality understanding.<n>We create challenging negative samples with subtle temporal disruptions such as reordering, action word replacement, partial captioning, and combined disruptions.<n>These benchmarks comprehensively test models' compositional sensitivity across extended, cohesive video-text sequences.
arXiv Detail & Related papers (2025-04-04T22:24:30Z)
The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning. Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z)
VideoAuteur: Towards Long Narrative Video Generation [22.915448471769384]
We present a large-scale cooking video dataset designed to advance long-form narrative generation in the cooking domain.<n>We introduce a Long Narrative Video Director to enhance both visual and semantic coherence in generated videos.<n>Our method demonstrates substantial improvements in generating visually detailed and semantically aligneds.
arXiv Detail & Related papers (2025-01-10T18:52:11Z)
Progress-Aware Video Frame Captioning [55.23366888264651]
We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.<n>We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.<n>Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
arXiv Detail & Related papers (2024-12-03T01:21:28Z)
Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning [71.94122309290537]
We propose an efficient, online approach to generate dense captions for videos. Our model uses a novel autoregressive factorized decoding architecture. Our approach shows excellent performance compared to both offline and online methods, and uses 20% less compute.
arXiv Detail & Related papers (2024-11-22T02:46:44Z)
Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos. We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text. We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z)
Scaling Up Video Summarization Pretraining with Large Language Models [73.74662411006426]
We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset. We analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. Our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals.
arXiv Detail & Related papers (2024-04-04T11:59:06Z)
From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Existing methods struggle to generate scene graphs with novel visual relation concepts. We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z)
VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale. Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z)
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video. In this paper, we address such limitations in video pre-training with an efficient video decomposition. Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z)
GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity. Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z)
CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM) This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z)
SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance. Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.