Semantic-Aware Pretraining for Dense Video Captioning
- URL: http://arxiv.org/abs/2204.07449v1
- Date: Wed, 13 Apr 2022 06:57:23 GMT
- Title: Semantic-Aware Pretraining for Dense Video Captioning
- Authors: Teng Wang, Zhu Liu, Feng Zheng, Zhichao Lu, Ran Cheng, Ping Luo
- Abstract summary: We present a semantic-aware pretraining method for dense video captioning, which empowers the learned features to recognize high-level semantic concepts.
Our final ensemble model achieves a 10.00 METEOR score on the test set.
- Score: 54.61034574151816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This report describes the details of our approach for the event
dense-captioning task in ActivityNet Challenge 2021. We present a
semantic-aware pretraining method for dense video captioning, which empowers
the learned features to recognize high-level semantic concepts. Diverse video
features of different modalities are fed into an event captioning module to
generate accurate and meaningful sentences. Our final ensemble model achieves a
10.00 METEOR score on the test set.
Related papers
- Tem-adapter: Adapting Image-Text Pretraining for Video Question Answer [79.20605034378187]
Video-language pre-trained models have shown remarkable success in guiding video question-answering tasks.
Due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones.
This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains.
arXiv Detail & Related papers (2023-08-16T15:00:50Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - MAViC: Multimodal Active Learning for Video Captioning [8.454261564411436]
In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
arXiv Detail & Related papers (2022-12-11T18:51:57Z) - Leveraging the Video-level Semantic Consistency of Event for
Audio-visual Event Localization [8.530561069113716]
We propose a novel video-level semantic consistency guidance network for the AVE localization task.
It consists of two components: a cross-modal event representation extractor and an intra-modal semantic consistency enhancer.
We perform extensive experiments on the public AVE dataset and outperform the state-of-the-art methods in both fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-10-11T08:15:57Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - Hierarchical Modular Network for Video Captioning [162.70349114104107]
We propose a hierarchical modular network to bridge video representations and linguistic semantics from three levels before generating captions.
The proposed method performs favorably against the state-of-the-art models on the two widely-used benchmarks: MSVD 104.0% and MSR-VTT 51.5% in CIDEr score.
arXiv Detail & Related papers (2021-11-24T13:07:05Z) - Towards Diverse Paragraph Captioning for Untrimmed Videos [40.205433926432434]
Existing approaches mainly solve the problem in two steps: event detection and then event captioning.
We propose a paragraph captioning model which eschews the problematic event detection stage and directly generates paragraphs for untrimmed videos.
arXiv Detail & Related papers (2021-05-30T09:28:43Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.