Related papers: MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering

URL: http://arxiv.org/abs/2010.14095v1
Date: Tue, 27 Oct 2020 06:34:14 GMT
Title: MMFT-BERT: Multimodal Fusion Transformer with BERT Encodings for Visual Question Answering
Authors: Aisha Urooj Khan, Amir Mazaheri, Niels da Vitoria Lobo, Mubarak Shah
Abstract summary: We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings) to solve Visual Question Answering (VQA) Our approach benefits from processing multimodal data adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together.
Score: 68.40719618351429
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present MMFT-BERT(MultiModal Fusion Transformer with BERT encodings), to solve Visual Question Answering (VQA) ensuring individual and combined processing of multiple input modalities. Our approach benefits from processing multimodal data (video and text) adopting the BERT encodings individually and using a novel transformer-based fusion method to fuse them together. Our method decomposes the different sources of modalities, into different BERT instances with similar architectures, but variable weights. This achieves SOTA results on the TVQA dataset. Additionally, we provide TVQA-Visual, an isolated diagnostic subset of TVQA, which strictly requires the knowledge of visual (V) modality based on a human annotator's judgment. This set of questions helps us to study the model's behavior and the challenges TVQA poses to prevent the achievement of super human performance. Extensive experiments show the effectiveness and superiority of our method.

Related papers

FuseLIP: Multimodal Embeddings via Early Fusion of Discrete Tokens [56.752362642658504]
We present FuseLIP, an alternative architecture for multimodal embedding.<n>We propose a single transformer model which operates on an extended vocabulary of text and image tokens.<n>We show that FuseLIP outperforms other approaches in multimodal embedding tasks such as VQA and text-guided image transformation retrieval.
arXiv Detail & Related papers (2025-06-03T17:27:12Z)
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering [8.21219588747224]
This paper introduces Vision-Language Multimodal Transformer (VLMT), a unified architecture that integrates a vision encoder with a sequence-to-sequence language model. VLMT employs a direct token-level injection mechanism to fuse visual and textual inputs within a shared embedding space. Comprehensive experiments on two benchmark datasets demonstrate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2025-04-11T05:51:44Z)
Extraction multi-étiquettes de relations en utilisant des couches de Transformer [0.0]
We present the BTransformer18 model, a deep learning architecture designed for multi-label relation extraction in French texts. Our approach combines the contextual representation capabilities of pre-trained language models with the power of Transformer encoders.
arXiv Detail & Related papers (2025-02-21T17:42:51Z)
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning. We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy. We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z)
Incorporating Probing Signals into Multimodal Machine Translation via Visual Question-Answering Pairs [45.41083125321069]
multimodal machine translation (MMT) systems exhibit decreased sensitivity to visual information when text inputs are complete. A novel approach is proposed to generate parallel Visual Question-Answering (VQA) style pairs from the source text. An MMT-VQA multitask learning framework is introduced to incorporate explicit probing signals from the dataset into the MMT training process.
arXiv Detail & Related papers (2023-10-26T04:13:49Z)
Exchanging-based Multimodal Fusion with Transformer [19.398692598523454]
We study the problem of multimodal fusion in this paper. Recent exchanging-based methods have been proposed for vision-vision fusion, which aim to exchange embeddings learned from one modality to the other. We propose a novel exchanging-based multimodal fusion model MuSE for text-vision fusion based on Transformer.
arXiv Detail & Related papers (2023-09-05T12:48:25Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Adapted Multimodal BERT with Layer-wise Fusion for Sentiment Analysis [84.12658971655253]
We propose Adapted Multimodal BERT, a BERT-based architecture for multimodal tasks. adapter adjusts the pretrained language model for the task at hand, while the fusion layers perform task-specific, layer-wise fusion of audio-visual information with textual BERT representations. In our ablations we see that this approach leads to efficient models, that can outperform their fine-tuned counterparts and are robust to input noise.
arXiv Detail & Related papers (2022-12-01T17:31:42Z)
Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion [112.27103169303184]
Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge. MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
arXiv Detail & Related papers (2022-05-04T23:40:04Z)
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization [65.09758931804478]
Three different data sources are combined: weakly-supervised videos, crowd-labeled text-image pairs and text-video pairs. A careful analysis of available pre-trained networks helps to choose the best prior-knowledge ones.
arXiv Detail & Related papers (2022-03-14T13:15:09Z)
schuBERT: Optimizing Elements of BERT [22.463154358632472]
We revisit the architecture choices of BERT in efforts to obtain a lighter model. We show that much efficient light BERT models can be obtained by reducing algorithmically chosen correct architecture design dimensions. In particular, our schuBERT gives $6.6%$ higher average accuracy on GLUE and SQuAD datasets as compared to BERT with three encoder layers.
arXiv Detail & Related papers (2020-05-09T21:56:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.