PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and
Multi-Head Decoding for Dense Video Captioning
- URL: http://arxiv.org/abs/2207.02583v1
- Date: Wed, 6 Jul 2022 10:56:53 GMT
- Title: PIC 4th Challenge: Semantic-Assisted Multi-Feature Encoding and
Multi-Head Decoding for Dense Video Captioning
- Authors: Yifan Lu, Ziqi Zhang, Yuxin Chen, Chunfeng Yuan, Bing Li, Weiming Hu
- Abstract summary: We present a semantic-assisted dense video captioning model based on the encoding-decoding framework.
Our method achieves significant improvements on the YouMakeup dataset under evaluation.
- Score: 46.69503728433432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Dense Video Captioning (DVC) aims to generate captions with
timestamps for multiple events in one video. Semantic information plays an
important role for both localization and description of DVC. We present a
semantic-assisted dense video captioning model based on the encoding-decoding
framework. In the encoding stage, we design a concept detector to extract
semantic information, which is then fused with multi-modal visual features to
sufficiently represent the input video. In the decoding stage, we design a
classification head, paralleled with the localization and captioning heads, to
provide semantic supervision. Our method achieves significant improvements on
the YouMakeup dataset under DVC evaluation metrics and achieves high
performance in the Makeup Dense Video Captioning (MDVC) task of PIC 4th
Challenge.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - When Video Coding Meets Multimodal Large Language Models: A Unified Paradigm for Video Coding [112.44822009714461]
Cross-Modality Video Coding (CMVC) is a pioneering approach to explore multimodality representation and video generative models in video coding.
During decoding, previously encoded components and video generation models are leveraged to create multiple encoding-decoding modes.
Experiments indicate that TT2V achieves effective semantic reconstruction, while IT2V exhibits competitive perceptual consistency.
arXiv Detail & Related papers (2024-08-15T11:36:18Z) - MAViC: Multimodal Active Learning for Video Captioning [8.454261564411436]
In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
arXiv Detail & Related papers (2022-12-11T18:51:57Z) - Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation [87.49579477873196]
We first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically.
A vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features.
In order to promote the temporal alignment between frames, we propose a language-guided multi-scale dynamic filtering (LMDF) module.
arXiv Detail & Related papers (2022-03-30T01:06:13Z) - Variational Stacked Local Attention Networks for Diverse Video
Captioning [2.492343817244558]
Variational Stacked Local Attention Network exploits low-rank bilinear pooling for self-attentive feature interaction.
We evaluate VSLAN on MSVD and MSR-VTT datasets in terms of syntax and diversity.
arXiv Detail & Related papers (2022-01-04T05:14:34Z) - DVCFlow: Modeling Information Flow Towards Human-like Video Captioning [163.71539565491113]
Existing methods mainly generate captions from individual video segments, lacking adaptation to the global visual context.
We introduce the concept of information flow to model the progressive information changing across video sequence and captions.
Our method significantly outperforms competitive baselines, and generates more human-like text according to subject and objective tests.
arXiv Detail & Related papers (2021-11-19T10:46:45Z) - Visual-aware Attention Dual-stream Decoder for Video Captioning [12.139806877591212]
The attention mechanism in the current video captioning method learns to assign weight to each frame, promoting the decoder dynamically.
This may not explicitly model the correlation and the temporal coherence of the visual features extracted in the sequence frames.
We propose a new Visual-aware Attention (VA) model, which unifies changes of temporal sequence frames with the words at the previous moment.
The effectiveness of the proposed Visual-aware Attention Dual-stream Decoder (VADD) is demonstrated.
arXiv Detail & Related papers (2021-10-16T14:08:20Z) - End-to-End Dense Video Captioning with Parallel Decoding [53.34238344647624]
We propose a simple yet effective framework for end-to-end dense video captioning with parallel decoding (PDVC)
PDVC precisely segments the video into a number of event pieces under the holistic understanding of the video content.
experiments on ActivityNet Captions and YouCook2 show that PDVC is capable of producing high-quality captioning results.
arXiv Detail & Related papers (2021-08-17T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.