RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER
- URL: http://arxiv.org/abs/2102.02967v1
- Date: Fri, 5 Feb 2021 02:45:30 GMT
- Title: RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER
- Authors: Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng
- Abstract summary: multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
- Score: 4.510210055307459
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently multimodal named entity recognition (MNER) has utilized images to
improve the accuracy of NER in tweets. However, most of the multimodal methods
use attention mechanisms to extract visual clues regardless of whether the text
and image are relevant. Practically, the irrelevant text-image pairs account
for a large proportion in tweets. The visual clues that are unrelated to the
texts will exert uncertain or even negative effects on multimodal model
learning. In this paper, we introduce a method of text-image relation
propagation into the multimodal BERT model. We integrate soft or hard gates to
select visual clues and propose a multitask algorithm to train on the MNER
datasets. In the experiments, we deeply analyze the changes in visual attention
before and after the use of text-image relation propagation. Our model achieves
state-of-the-art performance on the MNER datasets.
Related papers
- ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [38.08486689940946]
Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention.
It is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality.
In this paper, we propose bf Image-bf text bf Alignments (ITA) to align image features into the textual space.
arXiv Detail & Related papers (2021-12-13T08:29:43Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z) - Can images help recognize entities? A study of the role of images for
Multimodal NER [20.574849371747685]
Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context.
While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood.
arXiv Detail & Related papers (2020-10-23T23:41:51Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine
Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT.
We first represent the input sentence and image using a unified multi-modal graph.
We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.