End-to-End Video Question-Answer Generation with Generator-Pretester
Network
- URL: http://arxiv.org/abs/2101.01447v1
- Date: Tue, 5 Jan 2021 10:46:06 GMT
- Title: End-to-End Video Question-Answer Generation with Generator-Pretester
Network
- Authors: Hung-Ting Su, Chen-Hsi Chang, Po-Wei Shen, Yu-Siang Wang, Ya-Liang
Chang, Yu-Cheng Chang, Pu-Jen Cheng and Winston H. Hsu
- Abstract summary: We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
- Score: 27.31969951281815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study a novel task, Video Question-Answer Generation (VQAG), for
challenging Video Question Answering (Video QA) task in multimedia. Due to
expensive data annotation costs, many widely used, large-scale Video QA
datasets such as Video-QA, MSVD-QA and MSRVTT-QA are automatically annotated
using Caption Question Generation (CapQG) which inputs captions instead of the
video itself. As captions neither fully represent a video, nor are they always
practically available, it is crucial to generate question-answer pairs based on
a video via Video Question-Answer Generation (VQAG). Existing video-to-text
(V2T) approaches, despite taking a video as the input, only generate a question
alone. In this work, we propose a novel model Generator-Pretester Network that
focuses on two components: (1) The Joint Question-Answer Generator (JQAG) which
generates a question with its corresponding answer to allow Video Question
"Answering" training. (2) The Pretester (PT) verifies a generated question by
trying to answer it and checks the pretested answer with both the model's
proposed answer and the ground truth answer. We evaluate our system with the
only two available large-scale human-annotated Video QA datasets and achieves
state-of-the-art question generation performances. Furthermore, using our
generated QA pairs only on the Video QA task, we can surpass some supervised
baselines. We apply our generated questions to Video QA applications and
surpasses some supervised baselines using generated questions only. As a
pre-training strategy, we outperform both CapQG and transfer learning
approaches when employing semi-supervised (20%) or fully supervised learning
with annotated data. These experimental results suggest the novel perspectives
for Video QA training.
Related papers
- Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models [15.994664381976984]
We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
arXiv Detail & Related papers (2023-08-18T07:45:10Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Improving Unsupervised Question Answering via Summarization-Informed
Question Generation [47.96911338198302]
Question Generation (QG) is the task of generating a plausible question for a passage, answer> pair.
We make use of freely available news summary data, transforming declarative sentences into appropriate questions using dependency parsing, named entity recognition and semantic role labeling.
The resulting questions are then combined with the original news articles to train an end-to-end neural QG model.
arXiv Detail & Related papers (2021-09-16T13:08:43Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.