Video captioning with stacked attention and semantic hard pull
- URL: http://arxiv.org/abs/2009.07335v3
- Date: Fri, 16 Jul 2021 18:06:58 GMT
- Title: Video captioning with stacked attention and semantic hard pull
- Authors: Md. Mushfiqur Rahman, Thasin Abedin, Khondokar S. S. Prottoy, Ayana
Moshruba, Fazlul Hasan Siddiqui
- Abstract summary: The task of generating a semantically accurate description of a video is quite complex.
This paper proposes a novel architecture, namely Semantically Sensible Video Captioning (SSVC)
The paper reports that the use of the novelties improves the performance of state-of-the-art architectures.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video captioning, i.e. the task of generating captions from video sequences
creates a bridge between the Natural Language Processing and Computer Vision
domains of computer science. The task of generating a semantically accurate
description of a video is quite complex. Considering the complexity, of the
problem, the results obtained in recent research works are praiseworthy.
However, there is plenty of scope for further investigation. This paper
addresses this scope and proposes a novel solution. Most video captioning
models comprise two sequential/recurrent layers - one as a video-to-context
encoder and the other as a context-to-caption decoder. This paper proposes a
novel architecture, namely Semantically Sensible Video Captioning (SSVC) which
modifies the context generation mechanism by using two novel approaches -
"stacked attention" and "spatial hard pull". As there are no exclusive metrics
for evaluating video captioning models, we emphasize both quantitative and
qualitative analysis of our model. Hence, we have used the BLEU scoring metric
for quantitative analysis and have proposed a human evaluation metric for
qualitative analysis, namely the Semantic Sensibility (SS) scoring metric. SS
Score overcomes the shortcomings of common automated scoring metrics. This
paper reports that the use of the aforementioned novelties improves the
performance of state-of-the-art architectures.
Related papers
- AuroraCap: Efficient, Performant Video Detailed Captioning and a New Benchmark [73.62572976072578]
We propose AuroraCap, a video captioner based on a large multimodal model.
We implement the token merging strategy, reducing the number of input visual tokens.
AuroraCap shows superior performance on various video and image captioning benchmarks.
arXiv Detail & Related papers (2024-10-04T00:13:54Z) - BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues [47.213906345208315]
We propose BRIDGE, a new learnable and reference-free image captioning metric.
Our proposal achieves state-of-the-art results compared to existing reference-free evaluation scores.
arXiv Detail & Related papers (2024-07-29T18:00:17Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Contrastive Semantic Similarity Learning for Image Captioning Evaluation
with Intrinsic Auto-encoder [52.42057181754076]
Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning.
We develop three progressive model structures to learn the sentence level representations.
Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics.
arXiv Detail & Related papers (2021-06-29T12:27:05Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z) - Intrinsic Image Captioning Evaluation [53.51379676690971]
We propose a learning based metrics for image captioning, which we call Intrinsic Image Captioning Evaluation(I2CE)
Experiment results show that our proposed method can keep robust performance and give more flexible scores to candidate captions when encountered with semantic similar expression or less aligned semantics.
arXiv Detail & Related papers (2020-12-14T08:36:05Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.