Just Ask: Learning to Answer Questions from Millions of Narrated Videos
- URL: http://arxiv.org/abs/2012.00451v2
- Date: Tue, 30 Mar 2021 14:33:37 GMT
- Title: Just Ask: Learning to Answer Questions from Millions of Narrated Videos
- Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
- Abstract summary: We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
- Score: 97.44376735445454
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent methods for visual question answering rely on large-scale annotated
datasets. Manual annotation of questions and answers for videos, however, is
tedious, expensive and prevents scalability. In this work, we propose to avoid
manual annotation and generate a large-scale training dataset for video
question answering making use of automatic cross-modal supervision. We leverage
a question generation transformer trained on text data and use it to generate
question-answer pairs from transcribed video narrations. Given narrated videos,
we then automatically generate the HowToVQA69M dataset with 69M
video-question-answer triplets. To handle the open vocabulary of diverse
answers in this dataset, we propose a training procedure based on a contrastive
loss between a video-question multi-modal transformer and an answer
transformer. We introduce the zero-shot VideoQA task and show excellent
results, in particular for rare answers. Furthermore, we demonstrate our method
to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA,
ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a
new VideoQA dataset with reduced language biases and high-quality redundant
manual annotations. Our code and datasets will be made publicly available at
https://antoyang.github.io/just-ask.html.
Related papers
- Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner.
CoVGT's uniqueness and superiority are three-fold.
We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Mounting Video Metadata on Transformer-based Language Model for
Open-ended Video Question Answering [18.664991529995664]
We challenge the existing multiple-choice video question answering by changing it to open-ended video question answering.
To tackle open-ended question answering, we use the pretrained GPT2 model.
An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.
arXiv Detail & Related papers (2021-08-11T11:11:43Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.