Progress-Aware Video Frame Captioning
- URL: http://arxiv.org/abs/2412.02071v1
- Date: Tue, 03 Dec 2024 01:21:28 GMT
- Title: Progress-Aware Video Frame Captioning
- Authors: Zihui Xue, Joungbin An, Xitong Yang, Kristen Grauman,
- Abstract summary: We propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence.
We develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality.
Results demonstrate that ProgressCaptioner significantly surpasses leading captioning models.
- Score: 55.23366888264651
- License:
- Abstract: While image captioning provides isolated descriptions for individual images, and video captioning offers one single narrative for an entire video clip, our work explores an important middle ground: progress-aware video captioning at the frame level. This novel task aims to generate temporally fine-grained captions that not only accurately describe each frame but also capture the subtle progression of actions throughout a video sequence. Despite the strong capabilities of existing leading vision language models, they often struggle to discern the nuances of frame-wise differences. To address this, we propose ProgressCaptioner, a captioning model designed to capture the fine-grained temporal dynamics within an action sequence. Alongside, we develop the FrameCap dataset to support training and the FrameCapEval benchmark to assess caption quality. The results demonstrate that ProgressCaptioner significantly surpasses leading captioning models, producing precise captions that accurately capture action progression and set a new standard for temporal precision in video captioning. Finally, we showcase practical applications of our approach, specifically in aiding keyframe selection and advancing video understanding, highlighting its broad utility.
Related papers
- Classifier-Guided Captioning Across Modalities [69.75111271002137]
We introduce a method to adapt captioning networks to the semantics of alternative settings, such as capturing audibility in audio captioning.
Our framework consists of two main components: (i) a frozen captioning system incorporating a language model (LM), and (ii) a text classifier that guides the captioning system.
Notably, when combined with an existing zero-shot audio captioning system, our framework improves its quality and sets state-of-the-art performance in zero-shot audio captioning.
arXiv Detail & Related papers (2025-01-03T18:09:26Z) - AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [89.73538448786405]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning [24.608569008975497]
We propose AVCap, an Audio-Visual Captioning framework.
AVCap utilizes audio-visual features as text tokens.
Our method outperforms existing audio-visual captioning methods across all metrics.
arXiv Detail & Related papers (2024-07-10T16:17:49Z) - Learning text-to-video retrieval from image captioning [59.81537951811595]
We describe a protocol to study text-to-video retrieval training with unlabeled videos.
We assume (i) no access to labels for any videos, and (ii) access to labeled images in the form of text.
We show that automatically labeling video frames with image captioning allows text-to-video retrieval training.
arXiv Detail & Related papers (2024-04-26T15:56:08Z) - An Integrated Approach for Video Captioning and Applications [2.064612766965483]
We design hybrid deep learning architectures to apply in long videos by captioning videos.
We argue that linking images, videos, and natural language offers many practical benefits and immediate practical applications.
arXiv Detail & Related papers (2022-01-23T01:06:00Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z) - Optimizing Latency for Online Video CaptioningUsing Audio-Visual
Transformers [54.705393237822044]
This paper proposes a novel approach to optimize each caption's output timing based on a trade-off between latency and caption quality.
An audio-visual Trans-former is trained to generate ground-truth captions using only a small portion of all video frames.
A CNN-based timing detector is also trained to detect a proper output timing, where the captions generated by the two Trans-formers become sufficiently close to each other.
arXiv Detail & Related papers (2021-08-04T16:20:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.