MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual
Question Answering
- URL: http://arxiv.org/abs/2010.14095v1
- Date: Tue, 27 Oct 2020 06:34:14 GMT
- Title: MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual
Question Answering
- Authors: Aisha Urooj Khan, Amir Mazaheri, Niels da Vitoria Lobo, Mubarak Shah
- Abstract summary: We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings) to solve Visual Question Answering (VQA)
Our approach benefits from processing multimodal data adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together.
- Score: 68.40719618351429
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to
solve Visual Question Answering (VQA) ensuring individual and combined
processing of multiple input modalities. Our approach benefits from processing
multimodal data (video and text) adopting the BERT encodings individually and
using a novel transformer-based fusion method to fuse them together. Our method
decomposes the different sources of modalities, into different BERT instances
with similar architectures, but variable weights. This achieves SOTA results on
the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic
subset of TVQA, which strictly requires the knowledge of visual (V) modality
based on a human annotator's judgment. This set of questions helps us to study
the model's behavior and the challenges TVQA poses to prevent the achievement
of super human performance. Extensive experiments show the effectiveness and
superiority of our method.
Related papers
- CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Incorporating Probing Signals into Multimodal Machine Translation via
Visual Question-Answering Pairs [45.41083125321069]
multimodal machine translation (MMT) systems exhibit decreased sensitivity to visual information when text inputs are complete.
A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text.
An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process.
arXiv Detail & Related papers (2023-10-26T04:13:49Z) - Exchanging-based Multimodal Fusion with Transformer [19.398692598523454]
We study the problem of multimodal fusion in this paper.
Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other.
We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
arXiv Detail & Related papers (2023-09-05T12:48:25Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks.
adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations.
In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z) - Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z) - MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One
More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs.
A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z) - schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model.
We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions.
In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.