Video Question Answering on Screencast Tutorials
- URL: http://arxiv.org/abs/2008.00544v1
- Date: Sun, 2 Aug 2020 19:27:42 GMT
- Title: Video Question Answering on Screencast Tutorials
- Authors: Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin
- Abstract summary: We introduce a dataset including question, answer and context triples from the tutorial videos for a software.
An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering.
- Score: 43.00474548031818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a new video question answering task on screencast
tutorials. We introduce a dataset including question, answer and context
triples from the tutorial videos for a software. Unlike other video question
answering works, all the answers in our dataset are grounded to the domain
knowledge base. An one-shot recognition algorithm is designed to extract the
visual cues, which helps enhance the performance of video question answering.
We also propose several baseline neural network architectures based on various
aspects of video contexts from the dataset. The experimental results
demonstrate that our proposed models significantly improve the question
answering performances by incorporating multi-modal contexts and domain
knowledge.
Related papers
- YTCommentQA: Video Question Answerability in Instructional Videos [22.673000779017595]
We present the YTCommentQA dataset, which contains naturally-generated questions from YouTube.
The dataset is categorized by their answerability and required modality to answer -- visual, script, or both.
arXiv Detail & Related papers (2024-01-30T14:18:37Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z) - NEWSKVQA: Knowledge-Aware News Video Question Answering [5.720640816755851]
We explore a new frontier in video question answering: answering knowledge-based questions in the context of news videos.
We curate a new dataset of 12K news videos spanning across 156 hours with 1M multiple-choice question-answer pairs covering 8263 unique entities.
We propose a novel approach, NEWSKVQA which performs multi-modal inferencing over textual multiple-choice questions, videos, their transcripts and knowledge base.
arXiv Detail & Related papers (2022-02-08T17:31:31Z) - A Survey on Deep Learning Technique for Video Segmentation [147.0767454918527]
Video segmentation plays a critical role in a broad range of practical applications.
Deep learning based approaches have been dedicated to video segmentation and delivered compelling performance.
arXiv Detail & Related papers (2021-07-02T15:51:07Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z) - Convolutional Hierarchical Attention Network for Query-Focused Video
Summarization [74.48782934264094]
This paper addresses the task of query-focused video summarization, which takes user's query and a long video as inputs.
We propose a method, named Convolutional Hierarchical Attention Network (CHAN), which consists of two parts: feature encoding network and query-relevance computing module.
In the encoding network, we employ a convolutional network with local self-attention mechanism and query-aware global attention mechanism to learns visual information of each shot.
arXiv Detail & Related papers (2020-01-31T04:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.