Fine-Grained Video Captioning through Scene Graph Consolidation
- URL: http://arxiv.org/abs/2502.16427v1
- Date: Sun, 23 Feb 2025 03:59:05 GMT
- Title: Fine-Grained Video Captioning through Scene Graph Consolidation
- Authors: Sanghyeok Chu, Seonguk Seo, Bohyung Han,
- Abstract summary: We propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation.<n>Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions.
- Score: 44.30028794237688
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in visual language models (VLMs) have significantly improved image captioning, but extending these gains to video understanding remains challenging due to the scarcity of fine-grained video captioning datasets. To bridge this gap, we propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation. Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions. To achieve this, we leverage a lightweight graph-to-text model trained solely on text corpora, eliminating the need for video captioning annotations. Experiments on the MSR-VTT and ActivityNet Captions datasets show that our approach outperforms zero-shot video captioning baselines, demonstrating that aggregating frame-level scene graphs yields rich video understanding without requiring large-scale paired data or high inference cost.
Related papers
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.
Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z) - Progress-Aware Video Frame Captioning [55.23366888264651]
We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.<n>We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.<n>Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
arXiv Detail & Related papers (2024-12-03T01:21:28Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models [81.92098140232638]
Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks.
Existing methods struggle to generate scene graphs with novel visual relation concepts.
We introduce a new open-vocabulary SGG framework based on sequence generation.
arXiv Detail & Related papers (2024-04-01T04:21:01Z) - GL-RG: Global-Local Representation Granularity for Video Captioning [52.56883051799501]
We propose a GL-RG framework for video captioning, namely a textbfGlobal-textbfLocal textbfRepresentation textbfGranularity.
Our GL-RG demonstrates three advantages over the prior efforts: 1) we explicitly exploit extensive visual representations from different video ranges to improve linguistic expression; 2) we devise a novel global-local encoder to produce rich semantic vocabulary to obtain a descriptive granularity of video contents across frames; and 3) we develop an incremental training strategy which organizes model learning in an incremental fashion to incur an optimal captioning
arXiv Detail & Related papers (2022-05-22T02:00:09Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - SG2Caps: Revisiting Scene Graphs for Image Captioning [37.58310822924814]
We propose a framework, SG2Caps, that utilizes only the scene graph labels for competitive image caption-ing performance.
Our framework outperforms existing scene graph-only captioning models by a large margin (CIDEr score of 110 vs 71) indicating scene graphs as a promising representation for image captioning.
arXiv Detail & Related papers (2021-02-09T18:00:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.