MAViC: Multimodal Active Learning for Video Captioning
- URL: http://arxiv.org/abs/2212.11109v1
- Date: Sun, 11 Dec 2022 18:51:57 GMT
- Title: MAViC: Multimodal Active Learning for Video Captioning
- Authors: Gyanendra Das, Xavier Thomas, Anant Raj, Vikram Gupta
- Abstract summary: In this paper, we introduce MAViC to address the challenges of active learning approaches for video captioning.
Our approach integrates semantic similarity and uncertainty of both visual and language dimensions in the acquisition function.
- Score: 8.454261564411436
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large number of annotated video-caption pairs are required for training
video captioning models, resulting in high annotation costs. Active learning
can be instrumental in reducing these annotation requirements. However, active
learning for video captioning is challenging because multiple semantically
similar captions are valid for a video, resulting in high entropy outputs even
for less-informative samples. Moreover, video captioning algorithms are
multimodal in nature with a visual encoder and language decoder. Further, the
sequential and combinatorial nature of the output makes the problem even more
challenging. In this paper, we introduce MAViC which leverages our proposed
Multimodal Semantics Aware Sequential Entropy (M-SASE) based acquisition
function to address the challenges of active learning approaches for video
captioning. Our approach integrates semantic similarity and uncertainty of both
visual and language dimensions in the acquisition function. Our detailed
experiments empirically demonstrate the efficacy of M-SASE for active learning
for video captioning and improve on the baselines by a large margin.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z) - Active Learning for Video Description With Cluster-Regularized Ensemble
Ranking [3.5721078031625018]
We show that a cluster-regularized ensemble strategy provides the best active learning approach to efficiently gather training sets for video captioning.
We evaluate our approaches on the MSR-VTT and LSMDC datasets using both transformer and LSTM based captioning models.
arXiv Detail & Related papers (2020-07-27T23:52:41Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Video Captioning with Guidance of Multimodal Latent Topics [123.5255241103578]
We propose an unified caption framework, M&M TGM, which mines multimodal topics in unsupervised fashion from data.
Compared to pre-defined topics, the mined multimodal topics are more semantically and visually coherent.
The results from extensive experiments conducted on the MSR-VTT and Youtube2Text datasets demonstrate the effectiveness of our proposed model.
arXiv Detail & Related papers (2017-08-31T11:18:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.