Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA
- URL: http://arxiv.org/abs/2005.06409v1
- Date: Wed, 13 May 2020 16:35:27 GMT
- Title: Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA
- Authors: Hyounghun Kim, Zineng Tang, Mohit Bansal
- Abstract summary: We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
- Score: 96.10612095576333
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Videos convey rich information. Dynamic spatio-temporal relationships between
people/objects, and diverse multimodal events are present in a video clip.
Hence, it is important to develop automated models that can accurately extract
such information from videos. Answering questions on videos is one of the tasks
which can evaluate such AI abilities. In this paper, we propose a video
question answering model which effectively integrates multi-modal input sources
and finds the temporally relevant information to answer questions.
Specifically, we first employ dense image captions to help identify objects and
their detailed salient regions and actions, and hence give the model useful
extra information (in explicit textual format to allow easier matching) for
answering questions. Moreover, our model is also comprised of dual-level
attention (word/object and frame level), multi-head self/cross-integration for
different sources (video and dense captions), and gates which pass more
relevant information to the classifier. Finally, we also cast the frame
selection problem as a multi-label classification task and introduce two loss
functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary
Cross-Entropy (BBCE), to better supervise the model with human importance
annotations. We evaluate our model on the challenging TVQA dataset, where each
of our model components provides significant gains, and our overall model
outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We
also present several word, object, and frame level visualization studies. Our
code is publicly available at:
https://github.com/hyounghk/VideoQADenseCapFrameGate-ACL2020
Related papers
- Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Multi-object event graph representation learning for Video Question Answering [4.236280446793381]
We propose a contrastive language event graph representation learning method called CLanG to address this limitation.
Our method outperforms a strong baseline, achieving up to 2.2% higher accuracy on two challenging VideoQA, NExT-QA and TGIF-QA-R datasets.
arXiv Detail & Related papers (2024-09-12T04:42:51Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - MERLOT: Multimodal Neural Script Knowledge Models [74.05631672657452]
We introduce MERLOT, a model that learns multimodal script knowledge by watching millions of YouTube videos with transcribed speech.
MERLOT exhibits strong out-of-the-box representations of temporal commonsense, and achieves state-of-the-art performance on 12 different video QA datasets.
On Visual Commonsense Reasoning, MERLOT answers questions correctly with 80.6% accuracy, outperforming state-of-the-art models of similar size by over 3%.
arXiv Detail & Related papers (2021-06-04T17:57:39Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.