Knowledge-Based Video Question Answering with Unsupervised Scene
Descriptions
- URL: http://arxiv.org/abs/2007.08751v1
- Date: Fri, 17 Jul 2020 04:26:38 GMT
- Title: Knowledge-Based Video Question Answering with Unsupervised Scene
Descriptions
- Authors: Noa Garcia and Yuta Nakashima
- Abstract summary: We design ROLL, a model for knowledge-based video story question answering.
In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion.
To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism.
- Score: 27.63022376316052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To understand movies, humans constantly reason over the dialogues and actions
shown in specific scenes and relate them to the overall storyline already seen.
Inspired by this behaviour, we design ROLL, a model for knowledge-based video
story question answering that leverages three crucial aspects of movie
understanding: dialog comprehension, scene reasoning, and storyline recalling.
In ROLL, each of these tasks is in charge of extracting rich and diverse
information by 1) processing scene dialogues, 2) generating unsupervised video
scene descriptions, and 3) obtaining external knowledge in a weakly supervised
fashion. To answer a given question correctly, the information generated by
each inspired-cognitive task is encoded via Transformers and fused through a
modality weighting mechanism, which balances the information from the different
sources. Exhaustive evaluation demonstrates the effectiveness of our approach,
which yields a new state-of-the-art on two challenging video question answering
datasets: KnowIT VQA and TVQA+.
Related papers
- SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - MoviePuzzle: Visual Narrative Reasoning through Multimodal Order
Learning [54.73173491543553]
MoviePuzzle is a novel challenge that targets visual narrative reasoning and holistic movie understanding.
To tackle this quandary, we put forth MoviePuzzle task that amplifies the temporal feature learning and structure learning of video models.
Our approach outperforms existing state-of-the-art methods on the MoviePuzzle benchmark.
arXiv Detail & Related papers (2023-06-04T03:51:54Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Saying the Unseen: Video Descriptions via Dialog Agents [37.16726118481626]
We introduce a novel task that aims to describe a video using the natural language dialog between two agents.
Q-BOT is given two semantic segmented frames from the beginning and the end of the video.
A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions.
arXiv Detail & Related papers (2021-06-26T17:36:31Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - On the hidden treasure of dialog in video question answering [18.16094563534453]
We present a new approach to understand the whole story without external sources.
Unlike any prior work, we treat dialog as a noisy source to be converted into text description.
Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin.
arXiv Detail & Related papers (2021-03-26T15:17:01Z) - Co-attentional Transformers for Story-Based Video Understanding [24.211255523490692]
We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
arXiv Detail & Related papers (2020-10-27T07:17:09Z) - DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA [24.910132013543947]
We propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story.
Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips.
We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters.
arXiv Detail & Related papers (2020-05-07T09:44:58Z) - Video2Commonsense: Generating Commonsense Descriptions to Enrich Video
Captioning [56.97139024595429]
In videos that involve active agents such as humans, the agent's actions can bring about myriad changes in the scene.
Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent.
We present the first work on generating commonsense captions directly from videos, to describe latent aspects such as intentions, effects, and attributes.
arXiv Detail & Related papers (2020-03-11T08:42:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.