Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks
- URL: http://arxiv.org/abs/2310.04914v2
- Date: Fri, 24 Nov 2023 22:25:07 GMT
- Title: Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks
- Authors: Avinash Madasu, Anahita Bhiwandiwalla, Vasudev Lal
- Abstract summary: We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
- Score: 6.925770576386087
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Foundational multimodal models pre-trained on large scale image-text pairs or
video-text pairs or both have shown strong generalization abilities on
downstream tasks. However unlike image-text models, pretraining video-text
models is always not feasible due to the difficulty in collecting large-scale
clean and aligned data, and exponential computational costs involved in the
pretraining phase. Therefore, the pertinent question to ask is: Can image-text
models be adapted to video tasks and is there any benefit to using these models
over pretraining directly on videos? In this work, we focus on this question by
proposing a detailed study on the generalization abilities of image-text models
when evaluated on video understanding tasks in a zero-shot setting. We
investigate 9 foundational image-text models on a diverse set of video tasks
that include video action recognition (video AR), video retrieval (video RT),
video question answering (video QA), video multiple choice (video MC) and video
captioning (video CP). Our experiments show that image-text models exhibit
impressive performance on video AR, video RT and video MC. Furthermore, they
perform moderately on video captioning and poorly on video QA. These findings
shed a light on the benefits of adapting foundational image-text models to an
array of video tasks while avoiding the costly pretraining step.
Related papers
- Rethinking Video-Text Understanding: Retrieval from Counterfactually Augmented Data [19.210471935816273]
We propose a novel evaluation task for video-text understanding, namely retrieval from counterfactually augmented data (RCAD) and a new Feint6K dataset.
To succeed on our new evaluation task, models must derive a comprehensive understanding of the video from cross-frame reasoning.
Our approach successfully learn more discriminative action embeddings and improves results on Feint6K when applied to multiple video-text models.
arXiv Detail & Related papers (2024-07-18T01:55:48Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Harvest Video Foundation Models via Efficient Post-Pretraining [67.30842563833185]
We propose an efficient framework to harvest video foundation models from image ones.
Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure.
Our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models.
arXiv Detail & Related papers (2023-10-30T14:06:16Z) - Probabilistic Adaptation of Text-to-Video Models [181.84311524681536]
Video Adapter is capable of incorporating the broad knowledge and preserving the high fidelity of a large pretrained video model in a task-specific small video model.
Video Adapter is able to generate high-quality yet specialized videos on a variety of tasks such as animation, egocentric modeling, and modeling of simulated and real-world robotics data.
arXiv Detail & Related papers (2023-06-02T19:00:17Z) - InternVideo: General Video Foundation Models via Generative and
Discriminative Learning [52.69422763715118]
We present general video foundation models, InternVideo, for dynamic and complex video-level understanding tasks.
InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives.
InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications.
arXiv Detail & Related papers (2022-12-06T18:09:49Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - FitCLIP: Refining Large-Scale Pretrained Image-Text Models for Zero-Shot
Video Understanding Tasks [3.832696393393788]
Large-scale pretrained image-text models have shown incredible zero-shot performance in a handful of tasks.
We present a fine-tuning strategy to refine these models for zero-shot video understanding tasks.
arXiv Detail & Related papers (2022-03-24T22:35:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.