Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning
- URL: http://arxiv.org/abs/2312.06699v1
- Date: Sun, 10 Dec 2023 02:03:51 GMT
- Title: Leveraging Generative Language Models for Weakly Supervised Sentence
Component Analysis in Video-Language Joint Learning
- Authors: Zaber Ibn Abdul Hakim, Najibul Haque Sarker, Rahul Pratap Singh,
Bishmoy Paul, Ali Dabouei, Min Xu
- Abstract summary: A thorough comprehension of textual data is a fundamental element in multi-modal video analysis tasks.
We postulate that understanding the significance of the sentence components according to the target task can potentially enhance the performance of the models.
We propose a weakly supervised importance estimation module to compute the relative importance of the components and utilize them to improve different video-language tasks.
- Score: 10.486585276898472
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A thorough comprehension of textual data is a fundamental element in
multi-modal video analysis tasks. However, recent works have shown that the
current models do not achieve a comprehensive understanding of the textual data
during the training for the target downstream tasks. Orthogonal to the previous
approaches to this limitation, we postulate that understanding the significance
of the sentence components according to the target task can potentially enhance
the performance of the models. Hence, we utilize the knowledge of a pre-trained
large language model (LLM) to generate text samples from the original ones,
targeting specific sentence components. We propose a weakly supervised
importance estimation module to compute the relative importance of the
components and utilize them to improve different video-language tasks. Through
rigorous quantitative analysis, our proposed method exhibits significant
improvement across several video-language tasks. In particular, our approach
notably enhances video-text retrieval by a relative improvement of 8.3\% in
video-to-text and 1.4\% in text-to-video retrieval over the baselines, in terms
of R@1. Additionally, in video moment retrieval, average mAP shows a relative
improvement ranging from 2.0\% to 13.7 \% across different baselines.
Related papers
- Large Language Models for Video Surveillance Applications [11.297664744056735]
This paper presents an innovative proof of concept using Generative Artificial Intelligence (GenAI) in the form of Vision Language Models.
Our tool generates customized textual summaries based on user-defined queries, providing focused insights within extensive video datasets.
arXiv Detail & Related papers (2025-01-06T08:57:44Z) - NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality [52.08735848128973]
We study the capability of Video-Language (VidL) models in understanding compositions between objects, attributes, actions and their relations.
We propose a training method called NAVERO which utilizes video-text data augmented with negative texts to enhance composition understanding.
arXiv Detail & Related papers (2024-08-18T15:27:06Z) - T2V-CompBench: A Comprehensive Benchmark for Compositional Text-to-video Generation [55.57459883629706]
We conduct the first systematic study on compositional text-to-video generation.
We propose T2V-CompBench, the first benchmark tailored for compositional text-to-video generation.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - Autoregressive Pre-Training on Pixels and Texts [35.82610192457444]
We explore the dual modality of language--both visual and textual--within an autoregressive framework, pre-trained on both document images and texts.
Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head.
We find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks.
arXiv Detail & Related papers (2024-04-16T16:36:50Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - STOA-VLP: Spatial-Temporal Modeling of Object and Action for
Video-Language Pre-training [30.16501510589718]
We propose a pre-training framework that jointly models object and action information across spatial and temporal dimensions.
We design two auxiliary tasks to better incorporate both kinds of information into the pre-training process of the video-language model.
arXiv Detail & Related papers (2023-02-20T03:13:45Z) - Prompting Visual-Language Models for Efficient Video Understanding [28.754997650215486]
This paper presents a simple method to efficiently adapt one pre-trained visual-language model to novel tasks with minimal training.
To bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features.
arXiv Detail & Related papers (2021-12-08T18:58:16Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Object Relational Graph with Teacher-Recommended Learning for Video
Captioning [92.48299156867664]
We propose a complete video captioning system including both a novel model and an effective training strategy.
Specifically, we propose an object relational graph (ORG) based encoder, which captures more detailed interaction features to enrich visual representation.
Meanwhile, we design a teacher-recommended learning (TRL) method to make full use of the successful external language model (ELM) to integrate the abundant linguistic knowledge into the caption model.
arXiv Detail & Related papers (2020-02-26T15:34:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.