Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion
- URL: http://arxiv.org/abs/2205.02357v5
- Date: Mon, 18 Sep 2023 16:37:55 GMT
- Title: Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge
Graph Completion
- Authors: Xiang Chen, Ningyu Zhang, Lei Li, Shumin Deng, Chuanqi Tan, Changliang
Xu, Fei Huang, Luo Si, Huajun Chen
- Abstract summary: Multimodal Knowledge Graphs (MKGs) organize visual-text factual knowledge.
MKGformer can obtain SOTA performance on four datasets of multimodal link prediction, multimodal RE, and multimodal NER.
- Score: 112.27103169303184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Knowledge Graphs (MKGs), which organize visual-text factual
knowledge, have recently been successfully applied to tasks such as information
retrieval, question answering, and recommendation system. Since most MKGs are
far from complete, extensive knowledge graph completion studies have been
proposed focusing on the multimodal entity, relation extraction and link
prediction. However, different tasks and modalities require changes to the
model architecture, and not all images/objects are relevant to text input,
which hinders the applicability to diverse real-world scenarios. In this paper,
we propose a hybrid transformer with multi-level fusion to address those
issues. Specifically, we leverage a hybrid transformer architecture with
unified input-output for diverse multimodal knowledge graph completion tasks.
Moreover, we propose multi-level fusion, which integrates visual and text
representation via coarse-grained prefix-guided interaction and fine-grained
correlation-aware fusion modules. We conduct extensive experiments to validate
that our MKGformer can obtain SOTA performance on four datasets of multimodal
link prediction, multimodal RE, and multimodal NER. Code is available in
https://github.com/zjunlp/MKGformer.
Related papers
- StitchFusion: Weaving Any Visual Modalities to Enhance Multimodal Semantic Segmentation [63.31007867379312]
We propose StitchFusion, a framework that integrates large-scale pre-trained models directly as encoders and feature fusers.
We introduce a multi-directional adapter module (MultiAdapter) to enable cross-modal information transfer during encoding.
Our model achieves state-of-the-art performance on four multi-modal segmentation datasets with minimal additional parameters.
arXiv Detail & Related papers (2024-08-02T15:41:16Z) - Multimodal Information Interaction for Medical Image Segmentation [24.024848382458767]
We introduce an innovative Multimodal Information Cross Transformer (MicFormer)
It queries features from one modality and retrieves corresponding responses from another, facilitating effective communication between bimodal features.
Compared to other multimodal segmentation techniques, our method outperforms by margins of 2.83 and 4.23, respectively.
arXiv Detail & Related papers (2024-04-25T07:21:14Z) - NativE: Multi-modal Knowledge Graph Completion in the Wild [51.80447197290866]
We propose a comprehensive framework NativE to achieve MMKGC in the wild.
NativE proposes a relation-guided dual adaptive fusion module that enables adaptive fusion for any modalities.
We construct a new benchmark called WildKGC with five datasets to evaluate our method.
arXiv Detail & Related papers (2024-03-28T03:04:00Z) - Multi-Modal Knowledge Graph Transformer Framework for Multi-Modal Entity
Alignment [17.592908862768425]
We propose a novel MMEA transformer, called MoAlign, that hierarchically introduces neighbor features, multi-modal attributes, and entity types.
Taking advantage of the transformer's ability to better integrate multiple information, we design a hierarchical modifiable self-attention block in a transformer encoder.
Our approach outperforms strong competitors and achieves excellent entity alignment performance.
arXiv Detail & Related papers (2023-10-10T07:06:06Z) - Deep Equilibrium Multimodal Fusion [88.04713412107947]
Multimodal fusion integrates the complementary information present in multiple modalities and has gained much attention recently.
We propose a novel deep equilibrium (DEQ) method towards multimodal fusion via seeking a fixed point of the dynamic multimodal fusion process.
Experiments on BRCA, MM-IMDB, CMU-MOSI, SUN RGB-D, and VQA-v2 demonstrate the superiority of our DEQ fusion.
arXiv Detail & Related papers (2023-06-29T03:02:20Z) - Multimodal Token Fusion for Vision Transformers [54.81107795090239]
We propose a multimodal token fusion method (TokenFusion) for transformer-based vision tasks.
To effectively fuse multiple modalities, TokenFusion dynamically detects uninformative tokens and substitutes these tokens with projected and aggregated inter-modal features.
The design of TokenFusion allows the transformer to learn correlations among multimodal features, while the single-modal transformer architecture remains largely intact.
arXiv Detail & Related papers (2022-04-19T07:47:50Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.