Boosting Video Captioning with Dynamic Loss Network
- URL: http://arxiv.org/abs/2107.11707v1
- Date: Sun, 25 Jul 2021 01:32:02 GMT
- Title: Boosting Video Captioning with Dynamic Loss Network
- Authors: Nasibullah, Partha Pratim Mohanta
- Abstract summary: This paper addresses the drawback by introducing a dynamic loss network (DLN)
Our results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to Text (MSRVTT) datasets outperform previous methods.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video captioning is one of the challenging problems at the intersection of
vision and language, having many real-life applications in video retrieval,
video surveillance, assisting visually challenged people, Human-machine
interface, and many more. Recent deep learning-based methods have shown
promising results but are still on the lower side than other vision tasks (such
as image classification, object detection). A significant drawback with
existing video captioning methods is that they are optimized over cross-entropy
loss function, which is uncorrelated to the de facto evaluation metrics (BLEU,
METEOR, CIDER, ROUGE).In other words, cross-entropy is not a proper surrogate
of the true loss function for video captioning. This paper addresses the
drawback by introducing a dynamic loss network (DLN), which provides an
additional feedback signal that directly reflects the evaluation metrics. Our
results on Microsoft Research Video Description Corpus (MSVD) and MSR-Video to
Text (MSRVTT) datasets outperform previous methods.
Related papers
- Cross-Modal Transfer from Memes to Videos: Addressing Data Scarcity in Hateful Video Detection [8.05088621131726]
Video-based hate speech detection remains under-explored, hindered by a lack of annotated datasets and the high cost of video annotation.
We leverage meme datasets as both a substitution and an augmentation strategy for training hateful video detection models.
Our results consistently outperform state-of-the-art benchmarks.
arXiv Detail & Related papers (2025-01-26T07:50:14Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - MAViC: Multimodal Active Learning for Video Captioning [8.454261564411436]
In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
arXiv Detail & Related papers (2022-12-11T18:51:57Z) - RaP: Redundancy-aware Video-language Pre-training for Text-Video
Retrieval [61.77760317554826]
We propose Redundancy-aware Video-language Pre-training.
We design a redundancy measurement of video patches and text tokens by calculating the cross-modal minimum dis-similarity.
We evaluate our method on four benchmark datasets, MSRVTT, MSVD, DiDeMo, and LSMDC.
arXiv Detail & Related papers (2022-10-13T10:11:41Z) - Learning video retrieval models with relevance-aware online mining [16.548016892117083]
A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized.
This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized.
We propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives.
arXiv Detail & Related papers (2022-03-16T15:23:55Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Multi-modality Deep Restoration of Extremely Compressed Face Videos [36.83490465562509]
We develop a multi-modality deep convolutional neural network method for restoring face videos that are aggressively compressed.
The main innovation is a new DCNN architecture that incorporates known priors of multiple modalities.
Ample empirical evidences are presented to validate the superior performance of the proposed DCNN method on face videos.
arXiv Detail & Related papers (2021-07-05T16:29:02Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Learning the Loss Functions in a Discriminative Space for Video
Restoration [48.104095018697556]
We propose a new framework for building effective loss functions by learning a discriminative space specific to a video restoration task.
Our framework is similar to GANs in that we iteratively train two networks - a generator and a loss network.
Experiments on video superresolution and deblurring show that our method generates visually more pleasing videos.
arXiv Detail & Related papers (2020-03-20T06:58:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.