End-to-end Dense Video Captioning as Sequence Generation
- URL: http://arxiv.org/abs/2204.08121v1
- Date: Mon, 18 Apr 2022 01:30:54 GMT
- Title: End-to-end Dense Video Captioning as Sequence Generation
- Authors: Wanrong Zhu, Bo Pang, Ashish Thapliyal, William Yang Wang, Radu
Soricut
- Abstract summary: We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
- Score: 83.90502354328679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dense video captioning aims to identify the events of interest in an input
video, and generate descriptive captions for each event. Previous approaches
usually follow a two-stage generative process, which first proposes a segment
for each event, then renders a caption for each identified segment. Recent
advances in large-scale sequence generation pretraining have seen great success
in unifying task formulation for a great variety of tasks, but so far, more
complex tasks such as dense video captioning are not able to fully utilize this
powerful paradigm. In this work, we show how to model the two subtasks of dense
video captioning jointly as one sequence generation task, and simultaneously
predict the events and the corresponding descriptions. Experiments on YouCook2
and ViTT show encouraging results and indicate the feasibility of training
complex tasks such as end-to-end dense video captioning integrated into
large-scale pre-trained models.
Related papers
- Grounding Partially-Defined Events in Multimodal Data [61.0063273919745]
We introduce a multimodal formulation for partially-defined events and cast the extraction of these events as a three-stage span retrieval task.
We propose a benchmark for this task, MultiVENT-G, that consists of 14.5 hours of densely annotated current event videos and 1,168 text documents, containing 22.8K labeled event-centric entities.
Results illustrate the challenges that abstract event understanding poses and demonstrates promise in event-centric video-language systems.
arXiv Detail & Related papers (2024-10-07T17:59:48Z) - Unifying Event Detection and Captioning as Sequence Generation via
Pre-Training [53.613265415703815]
We propose a unified pre-training and fine-tuning framework to enhance the inter-task association between event detection and captioning.
Our model outperforms the state-of-the-art methods, and can be further boosted when pre-trained on extra large-scale video-text data.
arXiv Detail & Related papers (2022-07-18T14:18:13Z) - Semantic-Aware Pretraining for Dense Video Captioning [54.61034574151816]
We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts.
Our final ensemble model achieves a 10.00 METEOR score on the test set.
arXiv Detail & Related papers (2022-04-13T06:57:23Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - Team RUC_AIM3 Technical Report at Activitynet 2020 Task 2: Exploring
Sequential Events Detection for Dense Video Captioning [63.91369308085091]
We propose a novel and simple model for event sequence generation and explore temporal relationships of the event sequence in the video.
The proposed model omits inefficient two-stage proposal generation and directly generates event boundaries conditioned on bi-directional temporal dependency in one pass.
The overall system achieves state-of-the-art performance on the dense-captioning events in video task with 9.894 METEOR score on the challenge testing set.
arXiv Detail & Related papers (2020-06-14T13:21:37Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.