On the hidden treasure of dialog in video question answering
- URL: http://arxiv.org/abs/2103.14517v1
- Date: Fri, 26 Mar 2021 15:17:01 GMT
- Title: On the hidden treasure of dialog in video question answering
- Authors: Deniz Engin, Yannis Avrithis, Ngoc Q. K. Duong, Fran\c{c}ois
Schnitzler
- Abstract summary: We present a new approach to understand the whole story without external sources.
Unlike any prior work, we treat dialog as a noisy source to be converted into text description.
Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin.
- Score: 18.16094563534453
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: High-level understanding of stories in video such as movies and TV shows from
raw data is extremely challenging. Modern video question answering (VideoQA)
systems often use additional human-made sources like plot synopses, scripts,
video descriptions or knowledge bases. In this work, we present a new approach
to understand the whole story without such external sources. The secret lies in
the dialog: unlike any prior work, we treat dialog as a noisy source to be
converted into text description via dialog summarization, much like recent
methods treat video. The input of each modality is encoded by transformers
independently, and a simple fusion method combines all modalities, using soft
temporal attention for localization over long inputs. Our model outperforms the
state of the art on the KnowIT VQA dataset by a large margin, without using
question-specific human annotation or human-made plot summaries. It even
outperforms human evaluators who have never watched any whole episode before.
Related papers
- SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video.
We present a large-scale video grounding dataset named SynopGround.
We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z) - Learning Video Context as Interleaved Multimodal Sequences [40.15446453928028]
MovieSeq is a multimodal language model developed to address the wide range of challenges in understanding video contexts.
Our core idea is to represent videos as interleaved multimodal sequences, either by linking external knowledge databases or using offline models.
To demonstrate its effectiveness, we validate MovieSeq's performance on six datasets.
arXiv Detail & Related papers (2024-07-31T17:23:57Z) - Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation.
We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z) - TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z) - Learning to Retrieve Videos by Asking Questions [29.046045230398708]
We propose a novel framework for Video Retrieval using Dialog (ViReD), which enables the user to interact with an AI agent via multiple rounds of dialog.
The key contribution of our framework is a novel multimodal question generator that learns to ask questions that maximize the subsequent video retrieval performance.
We validate the effectiveness of our interactive ViReD framework on the AVSD dataset, showing that our interactive method performs significantly better than traditional non-interactive video retrieval systems.
arXiv Detail & Related papers (2022-05-11T19:14:39Z) - Synopses of Movie Narratives: a Video-Language Dataset for Story
Understanding [13.52545041750095]
We release a video-language story dataset, Synopses of Movie Narratives (SyMoN), containing 5,193 video summaries of popular movies and TV series with a total length of 869 hours.
SyMoN captures naturalistic storytelling videos made by human creators and intended for a human audience.
arXiv Detail & Related papers (2022-03-11T01:45:33Z) - Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering.
We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations.
We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z) - Co-attentional Transformers for Story-Based Video Understanding [24.211255523490692]
We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
arXiv Detail & Related papers (2020-10-27T07:17:09Z) - Multi-View Sequence-to-Sequence Models with Conversational Structure for
Abstractive Dialogue Summarization [72.54873655114844]
Text summarization is one of the most challenging and interesting problems in NLP.
This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations.
Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment.
arXiv Detail & Related papers (2020-10-04T20:12:44Z) - Knowledge-Based Video Question Answering with Unsupervised Scene
Descriptions [27.63022376316052]
We design ROLL, a model for knowledge-based video story question answering.
In ROLL, each of these tasks is in charge of extracting rich and diverse information by 1) processing scene dialogues, 2) generating unsupervised video scene descriptions, and 3) obtaining external knowledge in a weakly supervised fashion.
To answer a given question correctly, the information generated by each inspired-cognitive task is encoded via Transformers and fused through a modality weighting mechanism.
arXiv Detail & Related papers (2020-07-17T04:26:38Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.