SGCap: Decoding Semantic Group for Zero-shot Video Captioning
- URL: http://arxiv.org/abs/2508.01270v1
- Date: Sat, 02 Aug 2025 09:05:45 GMT
- Title: SGCap: Decoding Semantic Group for Zero-shot Video Captioning
- Authors: Zeyu Pan, Ping Li, Wenxiao Wang,
- Abstract summary: Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs.<n>We propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning.
- Score: 14.484825416367338
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Zero-shot video captioning aims to generate sentences for describing videos without training the model on video-text pairs, which remains underexplored. Existing zero-shot image captioning methods typically adopt a text-only training paradigm, where a language decoder reconstructs single-sentence embeddings obtained from CLIP. However, directly extending them to the video domain is suboptimal, as applying average pooling over all frames neglects temporal dynamics. To address this challenge, we propose a Semantic Group Captioning (SGCap) method for zero-shot video captioning. In particular, it develops the Semantic Group Decoding (SGD) strategy to employ multi-frame information while explicitly modeling inter-frame temporal relationships. Furthermore, existing zero-shot captioning methods that rely on cosine similarity for sentence retrieval and reconstruct the description supervised by a single frame-level caption, fail to provide sufficient video-level supervision. To alleviate this, we introduce two key components, including the Key Sentences Selection (KSS) module and the Probability Sampling Supervision (PSS) module. The two modules construct semantically-diverse sentence groups that models temporal dynamics and guide the model to capture inter-sentence causal relationships, thereby enhancing its generalization ability to video captioning. Experimental results on several benchmarks demonstrate that SGCap significantly outperforms previous state-of-the-art zero-shot alternatives and even achieves performance competitive with fully supervised ones. Code is available at https://github.com/mlvccn/SGCap_Video.
Related papers
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.<n>Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z) - Weakly Supervised Video Scene Graph Generation via Natural Language Supervision [27.97296273461145]
Existing Video Scene Graph Generation (VidSGG) studies are trained in a fully supervised manner.<n>We propose a Natural Language-based Video Scene Graph Generation framework that only utilizes the readily available video captions.
arXiv Detail & Related papers (2025-02-21T10:42:04Z) - Progress-Aware Video Frame Captioning [55.23366888264651]
We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.<n>We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.<n>Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
arXiv Detail & Related papers (2024-12-03T01:21:28Z) - DRCap: Decoding CLAP Latents with Retrieval-Augmented Generation for Zero-shot Audio Captioning [13.601154787754046]
DRCap is a data-efficient and flexible zero-shot audio captioning system.<n>It requires text-only data for training and can quickly adapt to new domains without additional fine-tuning.
arXiv Detail & Related papers (2024-10-12T10:21:00Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - Controllable Video Captioning with an Exemplar Sentence [89.78812365216983]
We propose a novel Syntax Modulated Caption Generator (SMCG) incorporated in an encoder-decoder-reconstructor architecture.
SMCG takes video semantic representation as an input, and conditionally modulates the gates and cells of long short-term memory network.
We conduct experiments by collecting auxiliary exemplar sentences for two public video captioning datasets.
arXiv Detail & Related papers (2021-12-02T09:24:45Z) - Syntax Customized Video Captioning by Imitating Exemplar Sentences [90.98221715705435]
We introduce a new task of Syntax Customized Video Captioning (SCVC)
SCVC aims to generate one caption which not only semantically describes the video contents but also syntactically imitates the given exemplar sentence.
We demonstrate our model capability to generate syntax-varied and semantics-coherent video captions.
arXiv Detail & Related papers (2021-12-02T09:08:09Z) - Semantic Grouping Network for Video Captioning [11.777063873936598]
The SGN learns an algorithm to capture the most discriminating word phrases of the partially decoded caption.
The continuous feedback from decoded words enables the SGN to dynamically update the video representation that adapts to the partially decoded caption.
The SGN achieves state-of-the-art performances by outperforming runner-up methods by a margin of 2.1%p and 2.4%p in a CIDEr-D score on MSVD and MSR-VTT datasets.
arXiv Detail & Related papers (2021-02-01T13:40:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.