Towards Diverse Paragraph Captioning for Untrimmed Videos
- URL: http://arxiv.org/abs/2105.14477v1
- Date: Sun, 30 May 2021 09:28:43 GMT
- Title: Towards Diverse Paragraph Captioning for Untrimmed Videos
- Authors: Yuqing Song, Shizhe Chen, Qin Jin
- Abstract summary: Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
- Score: 40.205433926432434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video paragraph captioning aims to describe multiple events in untrimmed
videos with descriptive paragraphs. Existing approaches mainly solve the
problem in two steps: event detection and then event captioning. Such two-step
manner makes the quality of generated paragraphs highly dependent on the
accuracy of event proposal detection which is already a challenging task. In
this paper, we propose a paragraph captioning model which eschews the
problematic event detection stage and directly generates paragraphs for
untrimmed videos. To describe coherent and diverse events, we propose to
enhance the conventional temporal attention with dynamic video memories, which
progressively exposes new video features and suppresses over-accessed video
contents to control visual focuses of the model. In addition, a
diversity-driven training strategy is proposed to improve diversity of
paragraph on the language perspective. Considering that untrimmed videos
generally contain massive but redundant frames, we further augment the video
encoder with keyframe awareness to improve efficiency. Experimental results on
the ActivityNet and Charades datasets show that our proposed model
significantly outperforms the state-of-the-art performance on both accuracy and
diversity metrics without using any event boundary annotations. Code will be
released at https://github.com/syuqings/video-paragraph.
Related papers
- Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval [9.899703354116962]
Dense video captioning aims to automatically localize and caption all events within untrimmed video.
We propose a novel framework inspired by the cognitive information processing of humans.
Our model utilizes external memory to incorporate prior knowledge.
arXiv Detail & Related papers (2024-04-11T09:58:23Z) - Shot2Story20K: A New Benchmark for Comprehensive Understanding of
Multi-shot Videos [58.13927287437394]
We present a new multi-shot video understanding benchmark Shot2Story20K with detailed shot-level captions and comprehensive video summaries.
Preliminary experiments show some challenges to generate a long and comprehensive video summary.
arXiv Detail & Related papers (2023-12-16T03:17:30Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Open-book Video Captioning with Retrieve-Copy-Generate Network [42.374461018847114]
In this paper, we convert traditional video captioning task into a new paradigm, ie, Open-book Video Captioning.
We propose a novel Retrieve-Copy-Generate network, where a pluggable video-to-text retriever is constructed to retrieve sentences as hints from the training corpus effectively.
Our framework coordinates the conventional retrieval-based methods with orthodox encoder-decoder methods, which can not only draw on the diverse expressions in the retrieved sentences but also generate natural and accurate content of the video.
arXiv Detail & Related papers (2021-03-09T08:17:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.