Related papers: Just Ask: Learning to Answer Questions from Millions of Narrated Videos

Just Ask: Learning to Answer Questions from Millions of Narrated Videos

URL: http://arxiv.org/abs/2012.00451v2
Date: Tue, 30 Mar 2021 14:33:37 GMT
Title: Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Authors: Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, Cordelia Schmid
Abstract summary: We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
Score: 97.44376735445454
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code and datasets will be made publicly available at https://antoyang.github.io/just-ask.html.

Related papers

Contrastive Video Question Answering via Video Graph Transformer [184.3679515511028]
We propose a Video Graph Transformer model (CoVGT) to perform question answering (VideoQA) in a Contrastive manner. CoVGT's uniqueness and superiority are three-fold. We show that CoVGT can achieve much better performances than previous arts on video reasoning tasks.
arXiv Detail & Related papers (2023-02-27T11:09:13Z)
Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA. Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Recent methods consider zero-shot settings with no manual annotation of visual question-answer. We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z)
Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z)
Mounting Video Metadata on Transformer-based Language Model for Open-ended Video Question Answering [18.664991529995664]
We challenge the existing multiple-choice video question answering by changing it to open-ended video question answering. To tackle open-ended question answering, we use the pretrained GPT2 model. An ablation study is performed by changing the existing DramaQA dataset to an open-ended question answering, and it shows that performance can be improved using video metadata.
arXiv Detail & Related papers (2021-08-11T11:11:43Z)
QVHighlights: Detecting Moments and Highlights in Videos via Natural Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset. It consists of over 10,000 YouTube videos, covering a wide range of topics. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z)
End-to-End Video Question-Answer Generation with Generator-Pretester Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG) We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.