Progress-Aware Video Frame Captioning
- URL: http://arxiv.org/abs/2412.02071v2
- Date: Wed, 26 Mar 2025 02:26:56 GMT
- Title: Progress-Aware Video Frame Captioning
- Authors: Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman,
- Abstract summary: We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.<n>We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.<n>Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
- Score: 55.23366888264651
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
Related papers
- The Devil is in the Distributions: Explicit Modeling of Scene Content is Key in Zero-Shot Video Captioning [89.64905703368255]
We propose a novel progressive multi-granularity textual prompting strategy for zero-shot video captioning.
Our approach constructs three distinct memory banks, encompassing noun phrases, scene graphs of noun phrases, and entire sentences.
arXiv Detail & Related papers (2025-03-31T03:00:19Z) - Fine-Grained Video Captioning through Scene Graph Consolidation [44.30028794237688]
We propose a novel zero-shot video captioning approach that combines frame-level scene graphs from a video to obtain intermediate representations for caption generation.
Our method first generates frame-level captions using an image VLM, converts them into scene graphs, and consolidates these graphs to produce comprehensive video-level descriptions.
arXiv Detail & Related papers (2025-02-23T03:59:05Z) - Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning [24.608569008975497]
We propose AVCap, an Audio-Visual Captioning framework.
AVCap utilizes audio-visual features as text tokens.
Our method outperforms existing audio-visual captioning methods across all metrics.
arXiv Detail & Related papers (2024-07-10T16:17:49Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - An Integrated Approach for Video Captioning and Applications [2.064612766965483]
We design hybrid deep learning architectures to apply in long videos by captioning videos.
We argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications.
arXiv Detail & Related papers (2022-01-23T01:06:00Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - SwinBERT: End-to-End Transformers with Sparse Attention for Video
Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning.
Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input.
Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z) - Optimizing Latency for Online Video CaptioningUsing Audio-Visual
Transformers [54.705393237822044]
This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality.
An audio-visual Trans-former is trained to generate ground-truth captions using only a small portion of all video frames.
A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.
arXiv Detail & Related papers (2021-08-04T16:20:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.