Redundancy-aware Transformer for Video Question Answering
- URL: http://arxiv.org/abs/2308.03267v1
- Date: Mon, 7 Aug 2023 03:16:24 GMT
- Title: Redundancy-aware Transformer for Video Question Answering
- Authors: Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, Tat-Seng Chua
- Abstract summary: We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner.
To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames.
As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
- Score: 71.98116071679065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper identifies two kinds of redundancy in the current VideoQA
paradigm. Specifically, the current video encoders tend to holistically embed
all video clues at different granularities in a hierarchical manner, which
inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm
detailed visual clues at the object level. Subsequently, prevailing
vision-language fusion designs introduce the \textit{cross-modal redundancy} by
exhaustively fusing all visual elements with question tokens without explicitly
differentiating their pairwise vision-language interactions, thus making a
pernicious impact on the answering.
To this end, we propose a novel transformer-based architecture, that aims to
model VideoQA in a redundancy-aware manner. To address the neighboring-frame
redundancy, we introduce a video encoder structure that emphasizes the
object-level change in neighboring frames, while adopting an out-of-neighboring
message-passing scheme that imposes attention only on distant frames. As for
the cross-modal redundancy, we equip our fusion module with a novel adaptive
sampling, which explicitly differentiates the vision-language interactions by
identifying a small subset of visual elements that exclusively support the
answer. Upon these advancements, we find this
\underline{R}edundancy-\underline{a}ware trans\underline{former} (RaFormer) can
achieve state-of-the-art results on multiple VideoQA benchmarks.
Related papers
- RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models.
By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information.
Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z) - Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better.
LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions.
An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z) - DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with
Diffusion Auto-encoder [21.405442790474268]
We propose DiffDub: Diffusion-based dubbing.
We first craft the Diffusion auto-encoder by an inpainting incorporating a mask to delineate editable zones and unaltered regions.
To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance.
arXiv Detail & Related papers (2023-11-03T09:41:51Z) - Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures.
We introduce an adaptive frame selection scheme to reduce the number of required incoming frames.
We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z) - Siamese Network with Interactive Transformer for Video Object
Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames.
We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z) - TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image.
We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z) - VX2TEXT: End-to-End Learning of Video-Based Text Generation From
Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio.
Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.