Learning Implicit Entity-object Relations by Bidirectional Generative
Alignment for Multimodal NER
- URL: http://arxiv.org/abs/2308.02570v1
- Date: Thu, 3 Aug 2023 10:37:20 GMT
- Title: Learning Implicit Entity-object Relations by Bidirectional Generative
Alignment for Multimodal NER
- Authors: Feng Chen, Jiajia Liu, Kaixiang Ji, Wang Ren, Jian Wang, Jingdong Wang
- Abstract summary: We propose a bidirectional generative alignment method named BGA-MNER to tackle these issues.
Our method achieves state-of-the-art performance without image input during inference.
- Score: 43.425998295991135
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The challenge posed by multimodal named entity recognition (MNER) is mainly
two-fold: (1) bridging the semantic gap between text and image and (2) matching
the entity with its associated object in image. Existing methods fail to
capture the implicit entity-object relations, due to the lack of corresponding
annotation. In this paper, we propose a bidirectional generative alignment
method named BGA-MNER to tackle these issues. Our BGA-MNER consists of
\texttt{image2text} and \texttt{text2image} generation with respect to
entity-salient content in two modalities. It jointly optimizes the
bidirectional reconstruction objectives, leading to aligning the implicit
entity-object relations under such direct and powerful constraints.
Furthermore, image-text pairs usually contain unmatched components which are
noisy for generation. A stage-refined context sampler is proposed to extract
the matched cross-modal content for generation. Extensive experiments on two
benchmarks demonstrate that our method achieves state-of-the-art performance
without image input during inference.
Related papers
- Token Merging for Training-Free Semantic Binding in Text-to-Image Synthesis [98.21700880115938]
Text-to-image (T2I) models often fail to accurately bind semantically related objects or attributes in the input prompts.
We introduce a novel method called Token Merging (ToMe), which enhances semantic binding by aggregating relevant tokens into a single composite token.
arXiv Detail & Related papers (2024-11-11T17:05:15Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - DRIN: Dynamic Relation Interactive Network for Multimodal Entity Linking [31.15972952813689]
We propose a novel framework called Dynamic Relation Interactive Network (DRIN) for MEL tasks.
DRIN explicitly models four different types of alignment between a mention and entity and builds a dynamic Graph Convolutional Network (GCN) to dynamically select the corresponding alignment relations for different input samples.
Experiments on two datasets show that DRIN outperforms state-of-the-art methods by a large margin, demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2023-10-09T10:21:42Z) - Dual-Gated Fusion with Prefix-Tuning for Multi-Modal Relation Extraction [13.454953507205278]
Multi-Modal Relation Extraction aims at identifying the relation between two entities in texts that contain visual clues.
We propose a novel MMRE framework to better capture the deeper correlations of text, entity pair, and image/objects.
Our approach achieves excellent performance compared to strong competitors, even in the few-shot situation.
arXiv Detail & Related papers (2023-06-19T15:31:34Z) - Collaborative Group: Composed Image Retrieval via Consensus Learning from Noisy Annotations [67.92679668612858]
We propose the Consensus Network (Css-Net), inspired by the psychological concept that groups outperform individuals.
Css-Net comprises two core components: (1) a consensus module with four diverse compositors, each generating distinct image-text embeddings; and (2) a Kullback-Leibler divergence loss that encourages learning of inter-compositor interactions.
On benchmark datasets, particularly FashionIQ, Css-Net demonstrates marked improvements. Notably, it achieves significant recall gains, with a 2.77% increase in R@10 and 6.67% boost in R@50, underscoring its
arXiv Detail & Related papers (2023-06-03T11:50:44Z) - From Alignment to Entailment: A Unified Textual Entailment Framework for
Entity Alignment [17.70562397382911]
Existing methods usually encode the triples of entities as embeddings and learn to align the embeddings.
We transform both triples into unified textual sequences, and model the EA task as a bi-directional textual entailment task.
Our approach captures the unified correlation pattern of two kinds of information between entities, and explicitly models the fine-grained interaction between original entity information.
arXiv Detail & Related papers (2023-05-19T08:06:50Z) - Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph
Alignment Network and Word-pair Relation Tagging [19.872199943795195]
This paper is the first to propose jointly performing MNER and MRE as a joint multimodal entity-relation extraction task.
The proposed method can leverage the edge information to auxiliary alignment between objects and entities.
arXiv Detail & Related papers (2022-11-28T03:23:54Z) - ReSTR: Convolution-free Referring Image Segmentation Using Transformers [80.9672131755143]
We present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR.
Since it extracts features of both modalities through transformer encoders, ReSTR can capture long-range dependencies between entities within each modality.
Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process.
arXiv Detail & Related papers (2022-03-31T02:55:39Z) - Context-Aware Layout to Image Generation with Enhanced Object Appearance [123.62597976732948]
A layout to image (L2I) generation model aims to generate a complicated image containing multiple objects (things) against natural background (stuff)
Existing L2I models have made great progress, but object-to-object and object-to-stuff relations are often broken.
We argue that these are caused by the lack of context-aware object and stuff feature encoding in their generators, and location-sensitive appearance representation in their discriminators.
arXiv Detail & Related papers (2021-03-22T14:43:25Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.