Robustness Analysis of Video-Language Models Against Visual and Language
Perturbations
- URL: http://arxiv.org/abs/2207.02159v4
- Date: Tue, 18 Jul 2023 17:23:25 GMT
- Title: Robustness Analysis of Video-Language Models Against Visual and Language
Perturbations
- Authors: Madeline C. Schiappa, Shruti Vyas, Hamid Palangi, Yogesh S. Rawat,
Vibhav Vineet
- Abstract summary: This study is the first extensive study of video-language robustness models against various real-world perturbations.
We propose two large-scale benchmark datasets, MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different text perturbations.
- Score: 10.862722733649543
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Joint visual and language modeling on large-scale datasets has recently shown
good progress in multi-modal tasks when compared to single modal learning.
However, robustness of these approaches against real-world perturbations has
not been studied. In this work, we perform the first extensive robustness study
of video-language models against various real-world perturbations. We focus on
text-to-video retrieval and propose two large-scale benchmark datasets,
MSRVTT-P and YouCook2-P, which utilize 90 different visual and 35 different
text perturbations. The study reveals some interesting initial findings from
the studied models: 1) models are generally more susceptible when only video is
perturbed as opposed to when only text is perturbed, 2) models that are
pre-trained are more robust than those trained from scratch, 3) models attend
more to scene and objects rather than motion and action. We hope this study
will serve as a benchmark and guide future research in robust video-language
learning. The benchmark introduced in this study along with the code and
datasets is available at https://bit.ly/3CNOly4.
Related papers
- Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - Grafting Pre-trained Models for Multimodal Headline Generation [12.063053852096514]
Multimodal headline utilizes both video frames and transcripts to generate the natural language title of the videos.
Previous researches on pre-trained language models and video-language models have achieved significant progress in related downstream tasks.
We propose a novel approach to graft the video encoder from the pre-trained video-language model on the generative pre-trained language model.
arXiv Detail & Related papers (2022-11-14T08:59:59Z) - Learning a Grammar Inducer from Massive Uncurated Instructional Videos [118.7279072358029]
Video-aided grammar induction aims to leverage video information for finding more accurate syntactic grammars for accompanying text.
We build a new model that can better learn video-span correlation without manually designed features.
Our model yields higher F1 scores than the previous state-of-the-art systems trained on in-domain data.
arXiv Detail & Related papers (2022-10-22T00:22:55Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - An Understanding-Oriented Robust Machine Reading Comprehension Model [12.870425062204035]
We propose an understanding-oriented machine reading comprehension model to address three kinds of robustness issues.
Specifically, we first use a natural language inference module to help the model understand the accurate semantic meanings of input questions.
Third, we propose a multilanguage learning mechanism to address the issue of generalization.
arXiv Detail & Related papers (2022-07-01T03:32:02Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Towards Fast Adaptation of Pretrained Contrastive Models for
Multi-channel Video-Language Retrieval [70.30052749168013]
Multi-channel video-language retrieval require models to understand information from different channels.
contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text.
There is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources.
arXiv Detail & Related papers (2022-06-05T01:43:52Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Towards Trustworthy Deception Detection: Benchmarking Model Robustness
across Domains, Modalities, and Languages [10.131671217810581]
We evaluate model robustness to out-of-domain data, modality-specific features, and languages other than English.
We find that with additional image content as input, ELMo embeddings yield significantly fewer errors compared to BERT orGLoVe.
arXiv Detail & Related papers (2021-04-23T18:05:52Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.