Named Entity and Relation Extraction with Multi-Modal Retrieval
- URL: http://arxiv.org/abs/2212.01612v1
- Date: Sat, 3 Dec 2022 13:11:32 GMT
- Title: Named Entity and Relation Extraction with Multi-Modal Retrieval
- Authors: Xinyu Wang, Jiong Cai, Yong Jiang, Pengjun Xie, Kewei Tu, Wei Lu
- Abstract summary: Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
- Score: 51.660650522630526
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-modal named entity recognition (NER) and relation extraction (RE) aim
to leverage relevant image information to improve the performance of NER and
RE. Most existing efforts largely focused on directly extracting potentially
useful information from images (such as pixel-level features, identified
objects, and associated captions). However, such extraction processes may not
be knowledge aware, resulting in information that may not be highly relevant.
In this paper, we propose a novel Multi-modal Retrieval based framework (MoRe).
MoRe contains a text retrieval module and an image-based retrieval module,
which retrieve related knowledge of the input text and image in the knowledge
corpus respectively. Next, the retrieval results are sent to the textual and
visual models respectively for predictions. Finally, a Mixture of Experts (MoE)
module combines the predictions from the two models to make the final decision.
Our experiments show that both our textual model and visual model can achieve
state-of-the-art performance on four multi-modal NER datasets and one
multi-modal RE dataset. With MoE, the model performance can be further improved
and our analysis demonstrates the benefits of integrating both textual and
visual cues for such tasks.
Related papers
- RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models.
We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - End-to-end Knowledge Retrieval with Multi-modal Queries [50.01264794081951]
ReMuQ requires a system to retrieve knowledge from a large corpus by integrating contents from both text and image queries.
We introduce a retriever model ReViz'' that can directly process input text and images to retrieve relevant knowledge in an end-to-end fashion.
We demonstrate superior performance in retrieval on two datasets under zero-shot settings.
arXiv Detail & Related papers (2023-06-01T08:04:12Z) - Multimodal Relation Extraction with Cross-Modal Retrieval and Synthesis [89.04041100520881]
This research proposes to retrieve textual and visual evidence based on the object, sentence, and whole image.
We develop a novel approach to synthesize the object-level, image-level, and sentence-level information for better reasoning between the same and different modalities.
arXiv Detail & Related papers (2023-05-25T15:26:13Z) - Information Screening whilst Exploiting! Multimodal Relation Extraction
with Feature Denoising and Multimodal Topic Modeling [96.75821232222201]
Existing research on multimodal relation extraction (MRE) faces two co-existing challenges, internal-information over-utilization and external-information under-exploitation.
We propose a novel framework that simultaneously implements the idea of internal-information screening and external-information exploiting.
arXiv Detail & Related papers (2023-05-19T14:56:57Z) - UniMS: A Unified Framework for Multimodal Summarization with Knowledge
Distillation [43.15662489492694]
We propose a Unified framework for Multimodal Summarization grounding on BART, UniMS.
We adopt knowledge distillation from a vision-language pretrained model to improve image selection.
Our best model achieves a new state-of-the-art result on a large-scale benchmark dataset.
arXiv Detail & Related papers (2021-09-13T09:36:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.