Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer
- URL: http://arxiv.org/abs/2302.02136v1
- Date: Sat, 4 Feb 2023 09:14:18 GMT
- Title: Efficient End-to-End Video Question Answering with Pyramidal Multimodal
Transformer
- Authors: Min Peng, Chongyang Wang, Yu Shi, Xiang-Dong Zhou
- Abstract summary: We present a new method for end-to-end Video Questioning (VideoQA)
We achieve this with a pyramidal multimodal transformer (PMT) model, which simply incorporates a learnable word embedding layer.
We demonstrate better or on-par performances with high computational efficiency against state-the-art methods on five VideoQA benchmarks.
- Score: 13.71165050314854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a new method for end-to-end Video Question Answering
(VideoQA), aside from the current popularity of using large-scale pre-training
with huge feature extractors. We achieve this with a pyramidal multimodal
transformer (PMT) model, which simply incorporates a learnable word embedding
layer, a few convolutional and transformer layers. We use the anisotropic
pyramid to fulfill video-language interactions across different spatio-temporal
scales. In addition to the canonical pyramid, which includes both bottom-up and
top-down pathways with lateral connections, novel strategies are proposed to
decompose the visual feature stream into spatial and temporal sub-streams at
different scales and implement their interactions with the linguistic semantics
while preserving the integrity of local and global semantics. We demonstrate
better or on-par performances with high computational efficiency against
state-of-the-art methods on five VideoQA benchmarks. Our ablation study shows
the scalability of our model that achieves competitive results for
text-to-video retrieval by leveraging feature extractors with reusable
pre-trained weights, and also the effectiveness of the pyramid.
Related papers
- Pyramidal Flow Matching for Efficient Video Generative Modeling [67.03504440964564]
This work introduces a unified pyramidal flow matching algorithm.
It sacrifices the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution.
The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT)
arXiv Detail & Related papers (2024-10-08T12:10:37Z) - Pyramid Hierarchical Transformer for Hyperspectral Image Classification [1.9427851979929982]
We propose a pyramid-based hierarchical transformer (PyFormer)
This innovative approach organizes input data hierarchically into segments, each representing distinct abstraction levels.
Results underscore the superiority of the proposed method over traditional approaches.
arXiv Detail & Related papers (2024-04-23T11:41:19Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - UMMAFormer: A Universal Multimodal-adaptive Transformer Framework for
Temporal Forgery Localization [16.963092523737593]
We propose a novel framework for temporal forgery localization (TFL) that predicts forgery segments with multimodal adaptation.
Our approach achieves state-of-the-art performance on benchmark datasets, including Lav-DF, TVIL, and Psynd.
arXiv Detail & Related papers (2023-08-28T08:20:30Z) - EgoViT: Pyramid Video Transformer for Egocentric Action Recognition [18.05706639179499]
Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
arXiv Detail & Related papers (2023-03-15T20:33:50Z) - Summarize the Past to Predict the Future: Natural Language Descriptions
of Context Boost Multimodal Object Interaction Anticipation [72.74191015833397]
We propose TransFusion, a multimodal transformer-based architecture.
It exploits the representational power of language by summarizing the action context.
Our model enables more efficient end-to-end learning.
arXiv Detail & Related papers (2023-01-22T21:30:12Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - TubeDETR: Spatio-Temporal Video Grounding with Transformers [89.71617065426146]
We consider the problem of encoder localizing a-temporal tube in a video corresponding to a given text query.
To address this task, we propose TubeDETR, a transformer- architecture inspired by the recent success of such models for text-conditioned object detection.
arXiv Detail & Related papers (2022-03-30T16:31:49Z) - Exploring Intra- and Inter-Video Relation for Surgical Semantic Scene
Segmentation [58.74791043631219]
We propose a novel framework STswinCL that explores the complementary intra- and inter-video relations to boost segmentation performance.
We extensively validate our approach on two public surgical video benchmarks, including EndoVis18 Challenge and CaDIS dataset.
Experimental results demonstrate the promising performance of our method, which consistently exceeds previous state-of-the-art approaches.
arXiv Detail & Related papers (2022-03-29T05:52:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.