Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
- URL: http://arxiv.org/abs/2306.08889v3
- Date: Fri, 7 Jun 2024 05:45:02 GMT
- Title: Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion
- Authors: Ishaan Singh Rawal, Alexander Matyasko, Shantanu Jaiswal, Basura Fernando, Cheston Tan,
- Abstract summary: VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
- Score: 54.33764537135906
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While VideoQA Transformer models demonstrate competitive performance on standard benchmarks, the reasons behind their success are not fully understood. Do these models capture the rich multimodal structures and dynamics from video and text jointly? Or are they achieving high scores by exploiting biases and spurious features? Hence, to provide insights, we design $\textit{QUAG}$ (QUadrant AveraGe), a lightweight and non-parametric probe, to conduct dataset-model combined representation analysis by impairing modality fusion. We find that the models achieve high performance on many datasets without leveraging multimodal representations. To validate QUAG further, we design $\textit{QUAG-attention}$, a less-expressive replacement of self-attention with restricted token interactions. Models with QUAG-attention achieve similar performance with significantly fewer multiplication operations without any finetuning. Our findings raise doubts about the current models' abilities to learn highly-coupled multimodal representations. Hence, we design the $\textit{CLAVI}$ (Complements in LAnguage and VIdeo) dataset, a stress-test dataset curated by augmenting real-world videos to have high modality coupling. Consistent with the findings of QUAG, we find that most of the models achieve near-trivial performance on CLAVI. This reasserts the limitations of current models for learning highly-coupled multimodal representations, that is not evaluated by the current datasets (project page: https://dissect-videoqa.github.io ).
Related papers
- Data-Juicer Sandbox: A Comprehensive Suite for Multimodal Data-Model Co-development [67.55944651679864]
We present a novel sandbox suite tailored for integrated data-model co-development.
This sandbox provides a comprehensive experimental platform, enabling rapid iteration and insight-driven refinement of both data and models.
We also uncover fruitful insights gleaned from exhaustive benchmarks, shedding light on the critical interplay between data quality, diversity, and model behavior.
arXiv Detail & Related papers (2024-07-16T14:40:07Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - UnIVAL: Unified Model for Image, Video, Audio and Language Tasks [105.77733287326308]
UnIVAL model goes beyond two modalities and unifies text, images, video, and audio into a single model.
Our model is efficiently pretrained on many tasks, based on task balancing and multimodal curriculum learning.
Thanks to the unified model, we propose a novel study on multimodal model merging via weight generalization.
arXiv Detail & Related papers (2023-07-30T09:48:36Z) - Multimodal Distillation for Egocentric Action Recognition [41.821485757189656]
egocentric video understanding involves modelling hand-object interactions.
Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well.
But their performance improves further by employing additional input modalities that provide complementary cues.
The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time.
arXiv Detail & Related papers (2023-07-14T17:07:32Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Does my multimodal model learn cross-modal interactions? It's harder to
tell than you might think! [26.215781778606168]
Cross-modal modeling seems crucial in multimodal tasks, such as visual question answering.
We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task.
For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation.
arXiv Detail & Related papers (2020-10-13T17:45:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.