Co-attentional Transformers for Story-Based Video Understanding
- URL: http://arxiv.org/abs/2010.14104v1
- Date: Tue, 27 Oct 2020 07:17:09 GMT
- Title: Co-attentional Transformers for Story-Based Video Understanding
- Authors: Bj\"orn Bebensee, Byoung-Tak Zhang
- Abstract summary: We propose a novel co-attentional transformer model to better capture long-term dependencies seen in visual stories such as dramas.
We evaluate our approach on the recently introduced DramaQA dataset which features character-centered video story understanding questions.
- Score: 24.211255523490692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by recent trends in vision and language learning, we explore
applications of attention mechanisms for visio-lingual fusion within an
application to story-based video understanding. Like other video-based QA
tasks, video story understanding requires agents to grasp complex temporal
dependencies. However, as it focuses on the narrative aspect of video it also
requires understanding of the interactions between different characters, as
well as their actions and their motivations. We propose a novel co-attentional
transformer model to better capture long-term dependencies seen in visual
stories such as dramas and measure its performance on the video question
answering task. We evaluate our approach on the recently introduced DramaQA
dataset which features character-centered video story understanding questions.
Our model outperforms the baseline model by 8 percentage points overall, at
least 4.95 and up to 12.8 percentage points on all difficulty levels and
manages to beat the winner of the DramaQA challenge.
Related papers
- Hawk: Learning to Understand Open-World Video Anomalies [76.9631436818573]
Video Anomaly Detection (VAD) systems can autonomously monitor and identify disturbances, reducing the need for manual labor and associated costs.
We introduce Hawk, a novel framework that leverages interactive large Visual Language Models (VLM) to interpret video anomalies precisely.
We have annotated over 8,000 anomaly videos with language descriptions, enabling effective training across diverse open-world scenarios, and also created 8,000 question-answering pairs for users' open-world questions.
arXiv Detail & Related papers (2024-05-27T07:08:58Z) - Long Story Short: a Summarize-then-Search Method for Long Video Question
Answering [23.094728230459125]
We investigate if language models can extend their zero-shot reasoning abilities to long multimodal narratives in multimedia content.
We propose Long Story Short, a framework for narrative video QA that first summarizes the narrative of the video to a short plot and then searches parts of the video relevant to the question.
Our model outperforms state-of-the-art supervised models by a large margin, highlighting the potential of zero-shot QA for long videos.
arXiv Detail & Related papers (2023-11-02T13:36:11Z) - FunQA: Towards Surprising Video Comprehension [64.58663825184958]
We introduce FunQA, a challenging video question-answering dataset.
FunQA covers three previously unexplored types of surprising videos: HumorQA, CreativeQA, and MagicQA.
In total, the FunQA benchmark consists of 312K free-text QA pairs derived from 4.3K video clips.
arXiv Detail & Related papers (2023-06-26T17:59:55Z) - A Video Is Worth 4096 Tokens: Verbalize Videos To Understand Them In
Zero Shot [67.00455874279383]
We propose verbalizing long videos to generate descriptions in natural language, then performing video-understanding tasks on the generated story as opposed to the original video.
Our method, despite being zero-shot, achieves significantly better results than supervised baselines for video understanding.
To alleviate a lack of story understanding benchmarks, we publicly release the first dataset on a crucial task in computational social science on persuasion strategy identification.
arXiv Detail & Related papers (2023-05-16T19:13:11Z) - NaQ: Leveraging Narrations as Queries to Supervise Episodic Memory [92.98552727430483]
Narrations-as-Queries (NaQ) is a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model.
NaQ improves multiple top models by substantial margins (even doubling their accuracy)
We also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories.
arXiv Detail & Related papers (2023-01-02T16:40:15Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - Character Matters: Video Story Understanding with Character-Aware
Relations [47.69347058141917]
Video Story Question Answering (VSQA) offers an effective way to benchmark higher-level comprehension abilities of a model.
Current VSQA methods merely extract generic visual features from a scene.
We propose a novel model that continuously refines character-aware relations.
arXiv Detail & Related papers (2020-05-09T06:51:13Z) - DramaQA: Character-Centered Video Story Understanding with Hierarchical
QA [24.910132013543947]
We propose a novel video question answering (Video QA) task, DramaQA, for a comprehensive understanding of the video story.
Our dataset is built upon the TV drama "Another Miss Oh" and it contains 17,983 QA pairs from 23,928 various length video clips.
We provide 217,308 annotated images with rich character-centered annotations, including visual bounding boxes, behaviors and emotions of main characters.
arXiv Detail & Related papers (2020-05-07T09:44:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.