Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
- URL: http://arxiv.org/abs/2309.15915v1
- Date: Wed, 27 Sep 2023 18:00:09 GMT
- Title: Zero-Shot and Few-Shot Video Question Answering with Multi-Modal Prompts
- Authors: Deniz Engin and Yannis Avrithis
- Abstract summary: Recent vision-language models are driven by large-scale pretrained models.
We introduce a parameter-efficient method to address challenges such as overfitting, catastrophic forgetting, and the cross-modal gap between vision and language.
Our experiments on several video question answering benchmarks demonstrate the superiority of our approach in terms of performance and parameter efficiency.
- Score: 14.610244867640471
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent vision-language models are driven by large-scale pretrained models.
However, adapting pretrained models on limited data presents challenges such as
overfitting, catastrophic forgetting, and the cross-modal gap between vision
and language. We introduce a parameter-efficient method to address these
challenges, combining multimodal prompt learning and a transformer-based
mapping network, while keeping the pretrained models frozen. Our experiments on
several video question answering benchmarks demonstrate the superiority of our
approach in terms of performance and parameter efficiency on both zero-shot and
few-shot settings. Our code is available at https://engindeniz.github.io/vitis.
Related papers
- eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Cross-Modal Adapter for Text-Video Retrieval [91.9575196703281]
We present a novel $textbfCross-Modal Adapter$ for parameter-efficient fine-tuning.
Inspired by adapter-based methods, we adjust the pre-trained model with a few parameterization layers.
It achieves superior or comparable performance compared to fully fine-tuned methods on MSR-VTT, MSVD, VATEX, ActivityNet, and DiDeMo datasets.
arXiv Detail & Related papers (2022-11-17T16:15:30Z) - Composing Ensembles of Pre-trained Models via Iterative Consensus [95.10641301155232]
We propose a unified framework for composing ensembles of different pre-trained models.
We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization.
We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer.
arXiv Detail & Related papers (2022-10-20T18:46:31Z) - Zero-Shot Learners for Natural Language Understanding via a Unified
Multiple Choice Perspective [26.41585967095811]
Zero-shot learning aims to train a model on a given task such that it can address new learning tasks without any additional training.
Our approach converts zero-shot learning into multiple-choice tasks, avoiding problems in commonly used large-scale generative models such as FLAN.
Our approach shows state-of-the-art performance on several benchmarks and produces satisfactory results on tasks such as natural language inference and text classification.
arXiv Detail & Related papers (2022-10-16T17:24:06Z) - Clover: Towards A Unified Video-Language Alignment and Fusion Model [154.1070559563592]
We introduce Clover, a Correlated Video-Language pre-training method.
It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task.
Clover establishes new state-of-the-arts on multiple downstream tasks.
arXiv Detail & Related papers (2022-07-16T09:38:52Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.