Delving Deeper into the Decoder for Video Captioning
- URL: http://arxiv.org/abs/2001.05614v3
- Date: Sat, 15 Feb 2020 01:31:21 GMT
- Title: Delving Deeper into the Decoder for Video Captioning
- Authors: Haoran Chen, Jianmin Li and Xiaolin Hu
- Abstract summary: Video captioning is an advanced multi-modal task which aims to describe a video clip using a natural language sentence.
We make a thorough investigation into the decoder and adopt three techniques to improve the performance of the model.
It is demonstrated in the experiments on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT) datasets.
- Score: 23.202746094988715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning is an advanced multi-modal task which aims to describe a
video clip using a natural language sentence. The encoder-decoder framework is
the most popular paradigm for this task in recent years. However, there exist
some problems in the decoder of a video captioning model. We make a thorough
investigation into the decoder and adopt three techniques to improve the
performance of the model. First of all, a combination of variational dropout
and layer normalization is embedded into a recurrent unit to alleviate the
problem of overfitting. Secondly, a new online method is proposed to evaluate
the performance of a model on a validation set so as to select the best
checkpoint for testing. Finally, a new training strategy called professional
learning is proposed which uses the strengths of a captioning model and
bypasses its weaknesses. It is demonstrated in the experiments on Microsoft
Research Video Description Corpus (MSVD) and MSR-Video to Text (MSR-VTT)
datasets that our model has achieved the best results evaluated by BLEU, CIDEr,
METEOR and ROUGE-L metrics with significant gains of up to 18% on MSVD and 3.5%
on MSR-VTT compared with the previous state-of-the-art models.
Related papers
- LD-DETR: Loop Decoder DEtection TRansformer for Video Moment Retrieval and Highlight Detection [8.24662649122549]
Video Moment Retrieval and Highlight Detection aim to find corresponding content in the video based on a text query.
Existing models usually first use contrastive learning methods to align video and text features, then fuse and extract multimodal information, and finally use a Transformer Decoder to decode multimodal information.
We propose the LD-DETR model for Video Moment Retrieval and Highlight Detection tasks.
arXiv Detail & Related papers (2025-01-18T14:54:56Z) - Video Decomposition Prior: A Methodology to Decompose Videos into Layers [74.36790196133505]
This paper introduces a novel video decomposition prior VDP' framework which derives inspiration from professional video editing practices.
VDP framework decomposes a video sequence into a set of multiple RGB layers and associated opacity levels.
We address tasks such as video object segmentation, dehazing, and relighting.
arXiv Detail & Related papers (2024-12-06T10:35:45Z) - Video Anomaly Detection and Explanation via Large Language Models [34.52845566893497]
Video Anomaly Detection (VAD) aims to localize abnormal events on the timeline of long-range surveillance videos.
In this paper, we conduct pioneer research on equipping video-based large language models (VLLMs) in the framework of VAD.
We introduce a novel network module Long-Term Context (LTC) to mitigate the incapability of VLLMs in long-range context modeling.
arXiv Detail & Related papers (2024-01-11T07:09:44Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - TACo: Token-aware Cascade Contrastive Learning for Video-Text Alignment [68.08689660963468]
A new algorithm called Token-Aware Cascade contrastive learning (TACo) improves contrastive learning using two novel techniques.
We set new state-of-the-art on three public text-video retrieval benchmarks of YouCook2, MSR-VTT and ActivityNet.
arXiv Detail & Related papers (2021-08-23T07:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.