Knowledge-Based Visual Question Answering in Videos
- URL: http://arxiv.org/abs/2004.08385v1
- Date: Fri, 17 Apr 2020 02:06:26 GMT
- Title: Knowledge-Based Visual Question Answering in Videos
- Authors: Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima
- Abstract summary: We introduce KnowIT VQA, a video dataset with 24,282 human-generated question-answer pairs about a popular sitcom.
The dataset combines visual, textual and temporal coherence reasoning together with knowledge-based questions.
Our main findings are: (i) the incorporation of knowledge produces outstanding improvements for VQA in video, and (ii) the performance on KnowIT VQA still lags well behind human accuracy.
- Score: 36.23723122336639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel video understanding task by fusing knowledge-based and
video question answering. First, we introduce KnowIT VQA, a video dataset with
24,282 human-generated question-answer pairs about a popular sitcom. The
dataset combines visual, textual and temporal coherence reasoning together with
knowledge-based questions, which need of the experience obtained from the
viewing of the series to be answered. Second, we propose a video understanding
model by combining the visual and textual video content with specific knowledge
about the show. Our main findings are: (i) the incorporation of knowledge
produces outstanding improvements for VQA in video, and (ii) the performance on
KnowIT VQA still lags well behind human accuracy, indicating its usefulness for
studying current video modelling limitations.
Related papers
- VQA$^2$: Visual Question Answering for Video Quality Assessment [76.81110038738699]
Video Quality Assessment (VQA) is a classic field in low-level visual perception.
Recent studies in the image domain have demonstrated that Visual Question Answering (VQA) can enhance markedly low-level visual quality evaluation.
We introduce the VQA2 Instruction dataset - the first visual question answering instruction dataset that focuses on video quality assessment.
The VQA2 series models interleave visual and motion tokens to enhance the perception of spatial-temporal quality details in videos.
arXiv Detail & Related papers (2024-11-06T09:39:52Z) - Knowledge Condensation and Reasoning for Knowledge-based VQA [20.808840633377343]
Recent studies retrieve the knowledge passages from external knowledge bases and then use them to answer questions.
We propose two synergistic models: Knowledge Condensation model and Knowledge Reasoning model.
Our method achieves state-of-the-art performance on knowledge-based VQA datasets.
arXiv Detail & Related papers (2024-03-15T06:06:06Z) - YTCommentQA: Video Question Answerability in Instructional Videos [22.673000779017595]
We present the YTCommentQA dataset, which contains naturally-generated questions from YouTube.
The dataset is categorized by their answerability and required modality to answer -- visual, script, or both.
arXiv Detail & Related papers (2024-01-30T14:18:37Z) - A Unified Model for Video Understanding and Knowledge Embedding with
Heterogeneous Knowledge Graph Dataset [47.805378137676605]
We propose a heterogeneous dataset that contains the multi-modal video entity and fruitful common sense relations.
Experiments indicate that combining video understanding embedding with factual knowledge benefits the content-based video retrieval performance.
It also helps the model generate better knowledge graph embedding which outperforms traditional KGE-based methods on VRT and VRV tasks.
arXiv Detail & Related papers (2022-11-19T09:00:45Z) - VLC-BERT: Visual Question Answering with Contextualized Commonsense
Knowledge [48.457788853408616]
We propose a method to generate, select, and encode external commonsense knowledge alongside visual and textual cues.
We show that VLC-BERT is capable of outperforming existing models that utilize static knowledge bases.
arXiv Detail & Related papers (2022-10-24T22:01:17Z) - REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
Question Answering [75.53187719777812]
This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
arXiv Detail & Related papers (2022-06-02T17:59:56Z) - Transferring Domain-Agnostic Knowledge in Video Question Answering [27.948768254771537]
Video question answering (VideoQA) is designed to answer a given question based on a relevant video clip.
In this paper, we investigate a transfer learning method by the introduction of domain-agnostic knowledge and domain-specific knowledge.
Our experiments show that: (i) domain-agnostic knowledge is transferable and (ii) our proposed transfer learning framework can boost VideoQA performance effectively.
arXiv Detail & Related papers (2021-10-26T03:58:31Z) - KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain
Knowledge-Based VQA [107.7091094498848]
One of the most challenging question types in VQA is when answering the question requires outside knowledge not present in the image.
In this work we study open-domain knowledge, the setting when the knowledge required to answer a question is not given/annotated, neither at training nor test time.
We tap into two types of knowledge representations and reasoning. First, implicit knowledge which can be learned effectively from unsupervised language pre-training and supervised training data with transformer-based models.
arXiv Detail & Related papers (2020-12-20T20:13:02Z) - Knowledge-Routed Visual Question Reasoning: Challenges for Deep
Representation Embedding [140.5911760063681]
We propose a novel dataset named Knowledge-Routed Visual Question Reasoning for VQA model evaluation.
We generate the question-answer pair based on both the Visual Genome scene graph and an external knowledge base with controlled programs.
arXiv Detail & Related papers (2020-12-14T00:33:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.