Multimodal Graph Transformer for Multimodal Question Answering
- URL: http://arxiv.org/abs/2305.00581v1
- Date: Sun, 30 Apr 2023 21:22:35 GMT
- Title: Multimodal Graph Transformer for Multimodal Question Answering
- Authors: Xuehai He, Xin Eric Wang
- Abstract summary: We propose a novel Multimodal Graph Transformer for question answering tasks that requires performing reasoning across multiple modalities.
We introduce a graph-involved plug-and-play quasi-attention mechanism to incorporate multimodal graph information.
We validate the effectiveness of Multimodal Graph Transformer over its Transformer baselines on GQA, VQAv2, and MultiModalQA datasets.
- Score: 9.292566397511763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the success of Transformer models in vision and language tasks, they
often learn knowledge from enormous data implicitly and cannot utilize
structured input data directly. On the other hand, structured learning
approaches such as graph neural networks (GNNs) that integrate prior
information can barely compete with Transformer models. In this work, we aim to
benefit from both worlds and propose a novel Multimodal Graph Transformer for
question answering tasks that requires performing reasoning across multiple
modalities. We introduce a graph-involved plug-and-play quasi-attention
mechanism to incorporate multimodal graph information, acquired from text and
visual data, to the vanilla self-attention as effective prior. In particular,
we construct the text graph, dense region graph, and semantic graph to generate
adjacency matrices, and then compose them with input vision and language
features to perform downstream reasoning. Such a way of regularizing
self-attention with graph information significantly improves the inferring
ability and helps align features from different modalities. We validate the
effectiveness of Multimodal Graph Transformer over its Transformer baselines on
GQA, VQAv2, and MultiModalQA datasets.
Related papers
- Task-Oriented Communication for Graph Data: A Graph Information Bottleneck Approach [12.451324619122405]
This paper introduces a method to extract a smaller, task-focused subgraph that maintains key information while reducing communication overhead.
Our approach utilizes graph neural networks (GNNs) and the graph information bottleneck (GIB) principle to create a compact, informative, and robust graph representation suitable for transmission.
arXiv Detail & Related papers (2024-09-04T14:01:56Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention Mechanism [10.573861741540853]
We propose a graph-guided self-attention mechanism, GraSAME, for pretrained language models.
GraSAME seamlessly incorporates token-level structural information into PLMs without necessitating additional alignment or concatenation efforts.
Our experiments on the graph-to-text generation task demonstrate that GraSAME outperforms baseline models and achieves results comparable to state-of-the-art (SOTA) models on WebNLG datasets.
arXiv Detail & Related papers (2024-04-10T11:03:57Z) - When Graph Data Meets Multimodal: A New Paradigm for Graph Understanding
and Reasoning [54.84870836443311]
The paper presents a new paradigm for understanding and reasoning about graph data by integrating image encoding and multimodal technologies.
This approach enables the comprehension of graph data through an instruction-response format, utilizing GPT-4V's advanced capabilities.
The study evaluates this paradigm on various graph types, highlighting the model's strengths and weaknesses, particularly in Chinese OCR performance and complex reasoning tasks.
arXiv Detail & Related papers (2023-12-16T08:14:11Z) - Deep Prompt Tuning for Graph Transformers [55.2480439325792]
Fine-tuning is resource-intensive and requires storing multiple copies of large models.
We propose a novel approach called deep graph prompt tuning as an alternative to fine-tuning.
By freezing the pre-trained parameters and only updating the added tokens, our approach reduces the number of free parameters and eliminates the need for multiple model copies.
arXiv Detail & Related papers (2023-09-18T20:12:17Z) - MMGA: Multimodal Learning with Graph Alignment [8.349066399479938]
We propose MMGA, a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media.
In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders.
We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.
arXiv Detail & Related papers (2022-10-18T15:50:31Z) - Dynamic Graph Message Passing Networks for Visual Recognition [112.49513303433606]
Modelling long-range dependencies is critical for scene understanding tasks in computer vision.
A fully-connected graph is beneficial for such modelling, but its computational overhead is prohibitive.
We propose a dynamic graph message passing network, that significantly reduces the computational complexity.
arXiv Detail & Related papers (2022-09-20T14:41:37Z) - Transformer for Graphs: An Overview from Architecture Perspective [86.3545861392215]
It's imperative to sort out the existing Transformer models for graphs and systematically investigate their effectiveness on various graph tasks.
We first disassemble the existing models and conclude three typical ways to incorporate the graph information into the vanilla Transformer.
Our experiments confirm the benefits of current graph-specific modules on Transformer and reveal their advantages on different kinds of graph tasks.
arXiv Detail & Related papers (2022-02-17T06:02:06Z) - GraphFormers: GNN-nested Transformers for Representation Learning on
Textual Graph [53.70520466556453]
We propose GraphFormers, where layerwise GNN components are nested alongside the transformer blocks of language models.
With the proposed architecture, the text encoding and the graph aggregation are fused into an iterative workflow.
In addition, a progressive learning strategy is introduced, where the model is successively trained on manipulated data and original data to reinforce its capability of integrating information on graph.
arXiv Detail & Related papers (2021-05-06T12:20:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.