Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models
- URL: http://arxiv.org/abs/2206.08155v1
- Date: Thu, 16 Jun 2022 13:18:20 GMT
- Title: Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models
- Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
- Abstract summary: Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
- Score: 89.71617065426146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video question answering (VideoQA) is a complex task that requires diverse
multi-modal data for training. Manual annotation of question and answers for
videos, however, is tedious and prohibits scalability. To tackle this problem,
recent methods consider zero-shot settings with no manual annotation of visual
question-answer. In particular, a promising approach adapts frozen
autoregressive language models pretrained on Web-scale text-only data to
multi-modal inputs. In contrast, we here build on frozen bidirectional language
models (BiLM) and show that such an approach provides a stronger and cheaper
alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs
with the frozen BiLM using light trainable modules, (ii) we train such modules
using Web-scraped multi-modal data, and finally (iii) we perform zero-shot
VideoQA inference through masked language modeling, where the masked text is
the answer to a given question. Our proposed approach, FrozenBiLM, outperforms
the state of the art in zero-shot VideoQA by a significant margin on a variety
of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA,
TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in
the few-shot and fully-supervised setting. Our code and models will be made
publicly available at https://antoyang.github.io/frozenbilm.html.
Related papers
- Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen
Large Language Models [69.59125732317972]
We propose a simple yet effective Retrieving-to-Answer (R2A) framework for VideoQA.
R2A first retrieves a set of semantically similar texts from a generic text corpus using a pre-trained multi-modal model.
With both the question and the retrieved texts, a LLM can be directly used to yield a desired answer.
arXiv Detail & Related papers (2023-06-15T20:56:20Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.