Redundancy-aware Transformer for Video Question Answering
- URL: http://arxiv.org/abs/2308.03267v1
- Date: Mon, 7 Aug 2023 03:16:24 GMT
- Title: Redundancy-aware Transformer for Video Question Answering
- Authors: Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, Tat-Seng Chua
- Abstract summary: We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
- Score: 71.98116071679065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper identifies two kinds of redundancy in the current VideoQA
paradigm. Specifically, the current video encoders tend to holistically embed
all video clues at different granularities in a hierarchical manner, which
inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm
detailed visual clues at the object level. Subsequently, prevailing
vision-language fusion designs introduce the \textit{cross-modal redundancy} by
exhaustively fusing all visual elements with question tokens without explicitly
differentiating their pairwise vision-language interactions, thus making a
pernicious impact on the answering.
To this end, we propose a novel transformer-based architecture, that aims to
model VideoQA in a redundancy-aware manner. To address the neighboring-frame
redundancy, we introduce a video encoder structure that emphasizes the
object-level change in neighboring frames, while adopting an out-of-neighboring
message-passing scheme that imposes attention only on distant frames. As for
the cross-modal redundancy, we equip our fusion module with a novel adaptive
sampling, which explicitly differentiates the vision-language interactions by
identifying a small subset of visual elements that exclusively support the
answer. Upon these advancements, we find this
\underline{R}edundancy-\underline{a}ware trans\underline{former} (RaFormer) can
achieve state-of-the-art results on multiple VideoQA benchmarks.
Related papers
- DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with
Diffusion Auto-encoder [21.405442790474268]
We propose DiffDub: Diffusion-based dubbing.
We first craft the Diffusion auto-encoder by an inpainting incorporating a mask to delineate editable zones and unaltered regions.
To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance.
arXiv Detail & Related papers (2023-11-03T09:41:51Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Referred by Multi-Modality: A Unified Temporal Transformer for Video
Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation.
With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference.
For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.