Learning to Answer Visual Questions from Web Videos
- URL: http://arxiv.org/abs/2205.05019v2
- Date: Wed, 11 May 2022 05:31:08 GMT
- Title: Learning to Answer Visual Questions from Web Videos
- Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
- Abstract summary: We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
- Score: 89.71617065426146
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent methods for visual question answering rely on large-scale annotated
datasets. Manual annotation of questions and answers for videos, however, is
tedious, expensive and prevents scalability. In this work, we propose to avoid
manual annotation and generate a large-scale training dataset for video
question answering making use of automatic cross-modal supervision. We leverage
a question generation transformer trained on text data and use it to generate
question-answer pairs from transcribed video narrations. Given narrated videos,
we then automatically generate the HowToVQA69M dataset with 69M
video-question-answer triplets. To handle the open vocabulary of diverse
answers in this dataset, we propose a training procedure based on a contrastive
loss between a video-question multi-modal transformer and an answer
transformer. We introduce the zero-shot VideoQA task and the VideoQA feature
probe evaluation setting and show excellent results, in particular for rare
answers. Furthermore, our method achieves competitive results on MSRVTT-QA,
ActivityNet-QA, MSVD-QA and How2QA datasets. We also show that our VideoQA
dataset generation approach generalizes to another source of web video and text
data. We use our method to generate the WebVidVQA3M dataset from the WebVid
dataset, i.e., videos with alt-text annotations, and show its benefits for
training VideoQA models. Finally, for a detailed evaluation we introduce iVQA,
a new VideoQA dataset with reduced language bias and high-quality manual
annotations. Code, datasets and trained models are available at
https://antoyang.github.io/just-ask.html
Related papers
- Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z) - Mounting Video Metadata on Transformer-based Language Model for
Open-ended Video Question Answering [18.664991529995664]
We challenge the existing multiple-choice video question answering by changing it to open-ended video question answering.
To tackle open-ended question answering, we use the pretrained GPT2 model.
An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.
arXiv Detail & Related papers (2021-08-11T11:11:43Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.