Multi-Grained Multimodal Interaction Network for Entity Linking
- URL: http://arxiv.org/abs/2307.09721v1
- Date: Wed, 19 Jul 2023 02:11:19 GMT
- Title: Multi-Grained Multimodal Interaction Network for Entity Linking
- Authors: Pengfei Luo, Tong Xu, Shiwei Wu, Chen Zhu, Linli Xu, Enhong Chen
- Abstract summary: Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
- Score: 65.30260033700338
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Multimodal entity linking (MEL) task, which aims at resolving ambiguous
mentions to a multimodal knowledge graph, has attracted wide attention in
recent years. Though large efforts have been made to explore the complementary
effect among multiple modalities, however, they may fail to fully absorb the
comprehensive expression of abbreviated textual context and implicit visual
indication. Even worse, the inevitable noisy data may cause inconsistency of
different modalities during the learning process, which severely degenerates
the performance. To address the above issues, in this paper, we propose a novel
Multi-GraIned Multimodal InteraCtion Network $\textbf{(MIMIC)}$ framework for
solving the MEL task. Specifically, the unified inputs of mentions and entities
are first encoded by textual/visual encoders separately, to extract global
descriptive features and local detailed features. Then, to derive the
similarity matching score for each mention-entity pair, we device three
interaction units to comprehensively explore the intra-modal interaction and
inter-modal fusion among features of entities and mentions. In particular,
three modules, namely the Text-based Global-Local interaction Unit (TGLU),
Vision-based DuaL interaction Unit (VDLU) and Cross-Modal Fusion-based
interaction Unit (CMFU) are designed to capture and integrate the fine-grained
representation lying in abbreviated text and implicit visual cues. Afterwards,
we introduce a unit-consistency objective function via contrastive learning to
avoid inconsistency and model degradation. Experimental results on three public
benchmark datasets demonstrate that our solution outperforms various
state-of-the-art baselines, and ablation studies verify the effectiveness of
designed modules.
Related papers
- Triple Modality Fusion: Aligning Visual, Textual, and Graph Data with Large Language Models for Multi-Behavior Recommendations [12.154043062308201]
This paper introduces a novel framework for multi-behavior recommendations, leveraging the fusion of triple-modality.
Our proposed model called Triple Modality Fusion (TMF) utilizes the power of large language models (LLMs) to align and integrate these three modalities.
Extensive experiments demonstrate the effectiveness of our approach in improving recommendation accuracy.
arXiv Detail & Related papers (2024-10-16T04:44:15Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - LoginMEA: Local-to-Global Interaction Network for Multi-modal Entity Alignment [18.365849722239865]
Multi-modal entity alignment (MMEA) aims to identify equivalent entities between two multi-modal knowledge graphs.
We propose a novel local-to-global interaction network for MMEA, termed as LoginMEA.
arXiv Detail & Related papers (2024-07-29T01:06:45Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal
Sentiment Analysis [96.46952672172021]
Bi-Bimodal Fusion Network (BBFN) is a novel end-to-end network that performs fusion on pairwise modality representations.
Model takes two bimodal pairs as input due to known information imbalance among modalities.
arXiv Detail & Related papers (2021-07-28T23:33:42Z) - Referring Image Segmentation via Cross-Modal Progressive Comprehension [94.70482302324704]
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression.
Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities.
We propose a Cross-Modal Progressive (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task.
arXiv Detail & Related papers (2020-10-01T16:02:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.