WildQA: In-the-Wild Video Question Answering
- URL: http://arxiv.org/abs/2209.06650v1
- Date: Wed, 14 Sep 2022 13:54:07 GMT
- Title: WildQA: In-the-Wild Video Question Answering
- Authors: Santiago Castro, Naihao Deng, Pingxuan Huang, Mihai Burzo, Rada
Mihalcea
- Abstract summary: We propose WILDQA, a video understanding dataset of videos recorded in outside settings.
We also introduce the new task of identifying visual support for a given question and answer.
- Score: 22.065516207195323
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Existing video understanding datasets mostly focus on human interactions,
with little attention being paid to the "in the wild" settings, where the
videos are recorded outdoors. We propose WILDQA, a video understanding dataset
of videos recorded in outside settings. In addition to video question answering
(Video QA), we also introduce the new task of identifying visual support for a
given question and answer (Video Evidence Selection). Through evaluations using
a wide range of baseline models, we show that WILDQA poses new challenges to
the vision and language research communities. The dataset is available at
https://lit.eecs.umich.edu/wildqa/.
Related papers
- CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - NEWSKVQA: Knowledge-Aware News Video Question Answering [5.720640816755851]
We explore a new frontier in video question answering: answering knowledge-based questions in the context of news videos.
We curate a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities.
We propose a novel approach, NEWSKVQA which performs multi-modal inferencing over textual multiple-choice questions, videos, their transcripts and knowledge base.
arXiv Detail & Related papers (2022-02-08T17:31:31Z) - QVHighlights: Detecting Moments and Highlights in Videos via Natural
Language Queries [89.24431389933703]
We present the Query-based Video Highlights (QVHighlights) dataset.
It consists of over 10,000 YouTube videos, covering a wide range of topics.
Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
arXiv Detail & Related papers (2021-07-20T16:42:58Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z) - Video Question Answering on Screencast Tutorials [43.00474548031818]
We introduce a dataset including question, answer and context triples from the tutorial videos for a software.
An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering.
arXiv Detail & Related papers (2020-08-02T19:27:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.