YTCommentQA: Video Question Answerability in Instructional Videos
- URL: http://arxiv.org/abs/2401.17343v1
- Date: Tue, 30 Jan 2024 14:18:37 GMT
- Title: YTCommentQA: Video Question Answerability in Instructional Videos
- Authors: Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee
- Abstract summary: We present the YTCommentQA dataset, which contains naturally-generated questions from YouTube.
The dataset is categorized by their answerability and required modality to answer -- visual, script, or both.
- Score: 22.673000779017595
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instructional videos provide detailed how-to guides for various tasks, with
viewers often posing questions regarding the content. Addressing these
questions is vital for comprehending the content, yet receiving immediate
answers is difficult. While numerous computational models have been developed
for Video Question Answering (Video QA) tasks, they are primarily trained on
questions generated based on video content, aiming to produce answers from
within the content. However, in real-world situations, users may pose questions
that go beyond the video's informational boundaries, highlighting the necessity
to determine if a video can provide the answer. Discerning whether a question
can be answered by video content is challenging due to the multi-modal nature
of videos, where visual and verbal information are intertwined. To bridge this
gap, we present the YTCommentQA dataset, which contains naturally-generated
questions from YouTube, categorized by their answerability and required
modality to answer -- visual, script, or both. Experiments with answerability
classification tasks demonstrate the complexity of YTCommentQA and emphasize
the need to comprehend the combined role of visual and script information in
video reasoning. The dataset is available at
https://github.com/lgresearch/YTCommentQA.
Related papers
- MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie
Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset.
We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - Video Question Answering: Datasets, Algorithms and Challenges [99.9179674610955]
Video Question Answering (VideoQA) aims to answer natural language questions according to the given videos.
This paper provides a clear taxonomy and comprehensive analyses to VideoQA, focusing on the datasets, algorithms, and unique challenges.
arXiv Detail & Related papers (2022-03-02T16:34:09Z) - NEWSKVQA: Knowledge-Aware News Video Question Answering [5.720640816755851]
We explore a new frontier in video question answering: answering knowledge-based questions in the context of news videos.
We curate a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities.
We propose a novel approach, NEWSKVQA which performs multi-modal inferencing over textual multiple-choice questions, videos, their transcripts and knowledge base.
arXiv Detail & Related papers (2022-02-08T17:31:31Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z) - Video Question Answering on Screencast Tutorials [43.00474548031818]
We introduce a dataset including question, answer and context triples from the tutorial videos for a software.
An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering.
arXiv Detail & Related papers (2020-08-02T19:27:42Z) - Knowledge-Based Visual Question Answering in Videos [36.23723122336639]
We introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom.
The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions.
Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy.
arXiv Detail & Related papers (2020-04-17T02:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.