METEOR Guided Divergence for Video Captioning
- URL: http://arxiv.org/abs/2212.10690v1
- Date: Tue, 20 Dec 2022 23:30:47 GMT
- Title: METEOR Guided Divergence for Video Captioning
- Authors: Daniel Lukas Rothenpieler and Shahin Amiriparian
- Abstract summary: We propose a reward-guided KL Divergence to train a video captioning model which is resilient towards token permutations.
We show the suitability of the HRL agent in the generation of content-complete and grammatically sound sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR scores, respectively.
- Score: 4.601294270277376
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic video captioning aims for a holistic visual scene understanding. It
requires a mechanism for capturing temporal context in video frames and the
ability to comprehend the actions and associations of objects in a given
timeframe. Such a system should additionally learn to abstract video sequences
into sensible representations as well as to generate natural written language.
While the majority of captioning models focus solely on the visual inputs,
little attention has been paid to the audiovisual modality. To tackle this
issue, we propose a novel two-fold approach. First, we implement a
reward-guided KL Divergence to train a video captioning model which is
resilient towards token permutations. Second, we utilise a Bi-Modal
Hierarchical Reinforcement Learning (BMHRL) Transformer architecture to capture
long-term temporal dependencies of the input data as a foundation for our
hierarchical captioning module. Using our BMHRL, we show the suitability of the
HRL agent in the generation of content-complete and grammatically sound
sentences by achieving $4.91$, $2.23$, and $10.80$ in BLEU3, BLEU4, and METEOR
scores, respectively on the ActivityNet Captions dataset. Finally, we make our
BMHRL framework and trained models publicly available for users and developers
at https://github.com/d-rothen/bmhrl.
Related papers
- Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Paraphrasing Is All You Need for Novel Object Captioning [126.66301869607656]
Novel object captioning (NOC) aims to describe images containing objects without observing their ground truth captions during training.
We present Paraphrasing-to-Captioning (P2C), a two-stage learning framework for NOC, which wouldally optimize the output captions via paraphrasing.
arXiv Detail & Related papers (2022-09-25T22:56:04Z) - Zero-Shot Video Captioning with Evolving Pseudo-Tokens [79.16706829968673]
We introduce a zero-shot video captioning method that employs two frozen networks: the GPT-2 language model and the CLIP image-text matching model.
The matching score is used to steer the language model toward generating a sentence that has a high average matching score to a subset of the video frames.
Our experiments show that the generated captions are coherent and display a broad range of real-world knowledge.
arXiv Detail & Related papers (2022-07-22T14:19:31Z) - Discriminative Latent Semantic Graph for Video Captioning [24.15455227330031]
Video captioning aims to automatically generate natural language sentences that describe the visual contents of a given video.
Our main contribution is to identify three key problems in a joint framework for future video summarization tasks.
arXiv Detail & Related papers (2021-08-08T15:11:20Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.