Temporal Pyramid Transformer with Multimodal Interaction for Video
Question Answering
- URL: http://arxiv.org/abs/2109.04735v1
- Date: Fri, 10 Sep 2021 08:31:58 GMT
- Title: Temporal Pyramid Transformer with Multimodal Interaction for Video
Question Answering
- Authors: Min Peng, Chongyang Wang, Yuan Gao, Yu Shi, Xiang-Dong Zhou
- Abstract summary: Video question answering (VideoQA) is challenging given its multimodal combination of visual understanding and natural language understanding.
This paper proposes a novel Temporal Pyramid Transformer (TPT) model with multimodal interaction for VideoQA.
- Score: 13.805714443766236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video question answering (VideoQA) is challenging given its multimodal
combination of visual understanding and natural language understanding. While
existing approaches seldom leverage the appearance-motion information in the
video at multiple temporal scales, the interaction between the question and the
visual information for textual semantics extraction is frequently ignored.
Targeting these issues, this paper proposes a novel Temporal Pyramid
Transformer (TPT) model with multimodal interaction for VideoQA. The TPT model
comprises two modules, namely Question-specific Transformer (QT) and Visual
Inference (VI). Given the temporal pyramid constructed from a video, QT builds
the question semantics from the coarse-to-fine multimodal co-occurrence between
each word and the visual content. Under the guidance of such question-specific
semantics, VI infers the visual clues from the local-to-global multi-level
interactions between the question and the video. Within each module, we
introduce a multimodal attention mechanism to aid the extraction of
question-video interactions, with residual connections adopted for the
information passing across different levels. Through extensive experiments on
three VideoQA datasets, we demonstrate better performances of the proposed
method in comparison with the state-of-the-arts.
Related papers
- Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Multilevel Hierarchical Network with Multiscale Sampling for Video
Question Answering [16.449212284367366]
We propose a novel Multilevel Hierarchical Network (MHN) with multiscale sampling for VideoQA.
MHN comprises two modules, namely Recurrent Multimodal Interaction (RMI) and Parallel Visual Reasoning (PVR)
With a multiscale sampling, RMI iterates the interaction of appearance-motion information at each scale and the question embeddings to build the multilevel question-guided visual representations.
PVR infers the visual cues at each level in parallel to fit with answering different question types that may rely on the visual information at relevant levels.
arXiv Detail & Related papers (2022-05-09T06:28:56Z) - Multi-Modal Interaction Graph Convolutional Network for Temporal
Language Localization in Videos [55.52369116870822]
This paper focuses on tackling the problem of temporal language localization in videos.
It aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video.
arXiv Detail & Related papers (2021-10-12T14:59:25Z) - Relation-aware Hierarchical Attention Framework for Video Question
Answering [6.312182279855817]
We propose a novel Relation-aware Hierarchical Attention (RHA) framework to learn both the static and dynamic relations of the objects in videos.
In particular, videos and questions are embedded by pre-trained models firstly to obtain the visual and textual features.
We consider the temporal, spatial, and semantic relations, and fuse the multimodal features by hierarchical attention mechanism to predict the answer.
arXiv Detail & Related papers (2021-05-13T09:35:42Z) - Multi-modal Transformer for Video Retrieval [67.86763073161012]
We present a multi-modal transformer to jointly encode the different modalities in video.
On the natural language side, we investigate the best practices to jointly optimize the language embedding together with the multi-modal transformer.
This novel framework allows us to establish state-of-the-art results for video retrieval on three datasets.
arXiv Detail & Related papers (2020-07-21T07:38:46Z) - Dynamic Graph Representation Learning for Video Dialog via Multi-Modal
Shuffled Transformers [89.00926092864368]
We present a semantics-controlled multi-modal shuffled Transformer reasoning framework for the audio-visual scene aware dialog task.
We also present a novel dynamic scene graph representation learning pipeline that consists of an intra-frame reasoning layer producing-semantic graph representations for every frame.
Our results demonstrate state-of-the-art performances on all evaluation metrics.
arXiv Detail & Related papers (2020-07-08T02:00:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.