Few-Shot Relation Extraction with Hybrid Visual Evidence
- URL: http://arxiv.org/abs/2403.00724v1
- Date: Fri, 1 Mar 2024 18:20:11 GMT
- Title: Few-Shot Relation Extraction with Hybrid Visual Evidence
- Authors: Jiaying Gong and Hoda Eldardiry
- Abstract summary: We propose a multi-modal few-shot relation extraction model (MFS-HVE)
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
Experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
- Score: 3.154631846975021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of few-shot relation extraction is to predict relations between name
entities in a sentence when only a few labeled instances are available for
training. Existing few-shot relation extraction methods focus on uni-modal
information such as text only. This reduces performance when there are no clear
contexts between the name entities described in text. We propose a multi-modal
few-shot relation extraction model (MFS-HVE) that leverages both textual and
visual semantic information to learn a multi-modal representation jointly. The
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
The MFS-HVE semantic feature extractors are developed to extract both textual
and visual features. The visual features include global image features and
local object features within the image. The MFS-HVE multi-modal fusion unit
integrates information from various modalities using image-guided attention,
object-guided attention, and hybrid feature attention to fully capture the
semantic interaction between visual regions of images and relevant texts.
Extensive experiments conducted on two public datasets demonstrate that
semantic visual information significantly improves the performance of few-shot
relation prediction.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues [20.587249765287183]
Feature Swapping Multi-modal Reasoning (FSMR) model is designed to enhance multi-modal reasoning through feature swapping.
FSMR incorporates a multi-modal cross-attention mechanism, facilitating the joint modeling of textual and visual information.
Experiments on the PMR dataset demonstrate FSMR's superiority over state-of-the-art baseline models.
arXiv Detail & Related papers (2024-03-29T07:28:50Z) - Image Fusion via Vision-Language Model [91.36809431547128]
We introduce a novel fusion paradigm named image Fusion via vIsion-Language Model (FILM)
FILM generates semantic prompts from images and inputs them into ChatGPT for comprehensive textual descriptions.
These descriptions are fused within the textual domain and guide the visual information fusion.
FILM has shown promising results in four image fusion tasks: infrared-visible, medical, multi-exposure, and multi-focus image fusion.
arXiv Detail & Related papers (2024-02-03T18:36:39Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - CISum: Learning Cross-modality Interaction to Enhance Multimodal
Semantic Coverage for Multimodal Summarization [2.461695698601437]
This paper proposes a multi-task cross-modality learning framework (CISum) to improve multimodal semantic coverage.
To obtain the visual semantics, we translate images into visual descriptions based on the correlation with text content.
Then, the visual description and text content are fused to generate the textual summary to capture the semantics of the multimodal content.
arXiv Detail & Related papers (2023-02-20T11:57:23Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - MHSAN: Multi-Head Self-Attention Network for Visual Semantic Embedding [6.4901484665257545]
We propose a novel multi-head self-attention network to capture various components of visual and textual data by attending to important parts in data.
Our approach achieves the new state-of-the-art results in image-text retrieval tasks on MS-COCO and Flicker30K datasets.
arXiv Detail & Related papers (2020-01-11T05:50:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.