Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models
- URL: http://arxiv.org/abs/2308.09363v1
- Date: Fri, 18 Aug 2023 07:45:10 GMT
- Title: Open-vocabulary Video Question Answering: A New Benchmark for Evaluating
the Generalizability of Video Question Answering Models
- Authors: Dohwan Ko, Ji Soo Lee, Miso Choi, Jaewon Chu, Jihwan Park, Hyunwoo J.
Kim
- Abstract summary: We introduce a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models.
In addition, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers.
Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance.
- Score: 15.994664381976984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering (VideoQA) is a challenging task that entails complex
multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to
predict the answer given several options, the goal of open-ended VideoQA is to
answer questions without restricting candidate answers. However, the majority
of previous VideoQA models formulate open-ended VideoQA as a classification
task to classify the video-question pairs into a fixed answer set, i.e.,
closed-vocabulary, which contains only frequent answers (e.g., top-1000
answers). This leads the model to be biased toward only frequent answers and
fail to generalize on out-of-vocabulary answers. We hence propose a new
benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the
generalizability of VideoQA models by considering rare and unseen answers. In
addition, in order to improve the model's generalization power, we introduce a
novel GNN-based soft verbalizer that enhances the prediction on rare and unseen
answers by aggregating the information from their similar words. For
evaluation, we introduce new baselines by modifying the existing
(closed-vocabulary) open-ended VideoQA models and improve their performances by
further taking into account rare and unseen answers. Our ablation studies and
qualitative analyses demonstrate that our GNN-based soft verbalizer further
improves the model performance, especially on rare and unseen answers. We hope
that our benchmark OVQA can serve as a guide for evaluating the
generalizability of VideoQA models and inspire future research. Code is
available at https://github.com/mlvlab/OVQA.
Related papers
- GSQA: An End-to-End Model for Generative Spoken Question Answering [54.418723701886115]
We introduce the first end-to-end Generative Spoken Question Answering (GSQA) model that empowers the system to engage in abstractive reasoning.
Our model surpasses the previous extractive model by 3% on extractive QA datasets.
Our GSQA model shows the potential to generalize to a broad spectrum of questions, thus further expanding the spoken question answering capabilities of abstractive QA.
arXiv Detail & Related papers (2023-12-15T13:33:18Z) - Locate before Answering: Answer Guided Question Localization for Video
Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model.
It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Invariant Grounding for Video Question Answering [72.87173324555846]
Video Question Answering (VideoQA) is the task of answering questions about a video.
In leading VideoQA models, the typical learning objective, empirical risk minimization (ERM), latches on superficial correlations between video-question pairs and answers.
We propose a new learning framework, Invariant Grounding for VideoQA (IGV), to ground the question-critical scene.
arXiv Detail & Related papers (2022-06-06T04:37:52Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - Video Question Answering with Phrases via Semantic Roles [40.72894813542082]
Video Question Answering (VidQA) evaluation metrics have been limited to a single-word answer or selecting a phrase from a fixed set of phrases.
We leverage semantic roles derived from video descriptions to mask out certain phrases, to introduce VidQAP which poses VidQA as a fill-in-the-phrase task.
arXiv Detail & Related papers (2021-04-08T13:27:43Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.