End-to-End Dense Video Captioning with Parallel Decoding
- URL: http://arxiv.org/abs/2108.07781v1
- Date: Tue, 17 Aug 2021 17:39:15 GMT
- Title: End-to-End Dense Video Captioning with Parallel Decoding
- Authors: Teng Wang, Ruimao Zhang, Zhichao Lu, Feng Zheng, Ran Cheng, Ping Luo
- Abstract summary: We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
- Score: 53.34238344647624
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dense video captioning aims to generate multiple associated captions with
their temporal locations from the video. Previous methods follow a
sophisticated "localize-then-describe" scheme, which heavily relies on numerous
hand-crafted components. In this paper, we proposed a simple yet effective
framework for end-to-end dense video captioning with parallel decoding (PDVC),
by formulating the dense caption generation as a set prediction task. In
practice, through stacking a newly proposed event counter on the top of a
transformer decoder, the PDVC precisely segments the video into a number of
event pieces under the holistic understanding of the video content, which
effectively increases the coherence and readability of predicted captions.
Compared with prior arts, the PDVC has several appealing advantages: (1)
Without relying on heuristic non-maximum suppression or a recurrent event
sequence selection network to remove redundancy, PDVC directly produces an
event set with an appropriate size; (2) In contrast to adopting the two-stage
scheme, we feed the enhanced representations of event queries into the
localization head and caption head in parallel, making these two sub-tasks
deeply interrelated and mutually promoted through the optimization; (3) Without
bells and whistles, extensive experiments on ActivityNet Captions and YouCook2
show that PDVC is capable of producing high-quality captioning results,
surpassing the state-of-the-art two-stage methods when its localization
accuracy is on par with them. Code is available at
https://github.com/ttengwang/PDVC.
Related papers
- When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - Enhancing Traffic Safety with Parallel Dense Video Captioning for End-to-End Event Analysis [5.4598424549754965]
This paper introduces our solution for Track 2 in AI City Challenge 2024.
The task aims to solve traffic safety description and analysis with the dataset of Woven Traffic Safety.
Our solution has yielded on the test set, achieving 6th place in the competition.
arXiv Detail & Related papers (2024-04-12T04:08:21Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and
Multi-Head Decoding for Dense Video Captioning [46.69503728433432]
We present a semantic-assisted dense video captioning model based on the encoding-decoding framework.
Our method achieves significant improvements on the YouMakeup dataset under evaluation.
arXiv Detail & Related papers (2022-07-06T10:56:53Z) - End-to-end Dense Video Captioning as Sequence Generation [83.90502354328679]
We show how to model the two subtasks of dense video captioning jointly as one sequence generation task.
Experiments on YouCook2 and ViTT show encouraging results and indicate the feasibility of training complex tasks integrated into large-scale pre-trained models.
arXiv Detail & Related papers (2022-04-18T01:30:54Z) - Dense Video Captioning Using Unsupervised Semantic Information [2.022555840231001]
We introduce a method to learn unsupervised semantic visual information based on the premise that complex events can be decomposed into simpler events.
We split a long video into short frame sequences to extract their latent representation with three-dimensional convolutional neural networks.
We demonstrate how this representation can leverage the performance of the dense video captioning task in a scenario with only visual features.
arXiv Detail & Related papers (2021-12-15T20:03:42Z) - Video Corpus Moment Retrieval with Contrastive Learning [56.249924768243375]
Video corpus moment retrieval (VCMR) is to retrieve a temporal moment that semantically corresponds to a given text query.
We propose a Retrieval and Localization Network with Contrastive Learning (ReLoCLNet) for VCMR.
Experimental results show that ReLoCLNet encodes text and video separately for efficiency, its retrieval accuracy is comparable with baselines adopting cross-modal interaction learning.
arXiv Detail & Related papers (2021-05-13T12:54:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.