Related papers: Redundancy-aware Transformer for Video Question Answering

Redundancy-aware Transformer for Video Question Answering

URL: http://arxiv.org/abs/2308.03267v1
Date: Mon, 7 Aug 2023 03:16:24 GMT
Title: Redundancy-aware Transformer for Video Question Answering
Authors: Yicong Li, Xun Yang, An Zhang, Chun Feng, Xiang Wang, Tat-Seng Chua
Abstract summary: We propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions.
Score: 71.98116071679065
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper identifies two kinds of redundancy in the current VideoQA paradigm. Specifically, the current video encoders tend to holistically embed all video clues at different granularities in a hierarchical manner, which inevitably introduces \textit{neighboring-frame redundancy} that can overwhelm detailed visual clues at the object level. Subsequently, prevailing vision-language fusion designs introduce the \textit{cross-modal redundancy} by exhaustively fusing all visual elements with question tokens without explicitly differentiating their pairwise vision-language interactions, thus making a pernicious impact on the answering. To this end, we propose a novel transformer-based architecture, that aims to model VideoQA in a redundancy-aware manner. To address the neighboring-frame redundancy, we introduce a video encoder structure that emphasizes the object-level change in neighboring frames, while adopting an out-of-neighboring message-passing scheme that imposes attention only on distant frames. As for the cross-modal redundancy, we equip our fusion module with a novel adaptive sampling, which explicitly differentiates the vision-language interactions by identifying a small subset of visual elements that exclusively support the answer. Upon these advancements, we find this \underline{R}edundancy-\underline{a}ware trans\underline{former} (RaFormer) can achieve state-of-the-art results on multiple VideoQA benchmarks.

Related papers

RepVideo: Rethinking Cross-Layer Representation for Video Generation [53.701548524818534]
We propose RepVideo, an enhanced representation framework for text-to-video diffusion models. By accumulating features from neighboring layers to form enriched representations, this approach captures more stable semantic information. Our experiments demonstrate that our RepVideo not only significantly enhances the ability to generate accurate spatial appearances, but also improves temporal consistency in video generation.
arXiv Detail & Related papers (2025-01-15T18:20:37Z)
Foundation Models and Adaptive Feature Selection: A Synergistic Approach to Video Question Answering [13.294004180200496]
We introduce Local-Global Question Aware Video Embedding (LGQAVE), which incorporates three major innovations to integrate multi-modal knowledge better. LGQAVE moves beyond traditional ad-hoc frame sampling by utilizing a cross-attention mechanism that precisely identifies the most relevant frames concerning the questions. An additional cross-attention module integrates these local and global embeddings to generate the final video embeddings, which a language model uses to generate answers.
arXiv Detail & Related papers (2024-12-12T12:39:07Z)
DiffDub: Person-generic Visual Dubbing Using Inpainting Renderer with Diffusion Auto-encoder [21.405442790474268]
We propose DiffDub: Diffusion-based dubbing. We first craft the Diffusion auto-encoder by an inpainting incorporating a mask to delineate editable zones and unaltered regions. To tackle these issues, we employ versatile strategies, including data augmentation and supplementary eye guidance.
arXiv Detail & Related papers (2023-11-03T09:41:51Z)
Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z)
Referred by Multi-Modality: A Unified Temporal Transformer for Video Object Segmentation [54.58405154065508]
We propose a Multi-modal Unified Temporal transformer for Referring video object segmentation. With a unified framework for the first time, MUTR adopts a DETR-style transformer and is capable of segmenting video objects designated by either text or audio reference. For high-level temporal interaction after the transformer, we conduct inter-frame feature communication for different object embeddings, contributing to better object-wise correspondence for tracking along the video.
arXiv Detail & Related papers (2023-05-25T17:59:47Z)
Diverse Video Captioning by Adaptive Spatio-temporal Attention [7.96569366755701]
Our end-to-end encoder-decoder video captioning framework incorporates two transformer-based architectures. We introduce an adaptive frame selection scheme to reduce the number of required incoming frames. We estimate semantic concepts relevant for video captioning by aggregating all ground captions truth of each sample.
arXiv Detail & Related papers (2022-08-19T11:21:59Z)
Siamese Network with Interactive Transformer for Video Object Segmentation [34.202137199782804]
We propose a network with a specifically designed interactive transformer, called SITVOS, to enable effective context propagation from historical to current frames. We employ the backbone architecture to extract backbone features of both past and current frames, which enables feature reuse and is more efficient than existing methods.
arXiv Detail & Related papers (2021-12-28T03:38:17Z)
TransVG: End-to-End Visual Grounding with Transformers [102.11922622103613]
We present a transformer-based framework for visual grounding, namely TransVG, to address the task of grounding a language query to an image. We show that the complex fusion modules can be replaced by a simple stack of transformer encoder layers with higher performance.
arXiv Detail & Related papers (2021-04-17T13:35:24Z)
VX2TEXT: End-to-End Learning of Video-Based Text Generation From Multimodal Inputs [103.99315770490163]
We present a framework for text generation from multimodal inputs consisting of video plus text, speech, or audio. Experiments demonstrate that our approach based on a single architecture outperforms the state-of-the-art on three video-based text-generation tasks.
arXiv Detail & Related papers (2021-01-28T15:22:36Z)
Hierarchical Conditional Relation Networks for Multimodal Video Question Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query. Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs. CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.