RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER
- URL: http://arxiv.org/abs/2102.02967v1
- Date: Fri, 5 Feb 2021 02:45:30 GMT
- Title: RpBERT: A Text-image Relation Propagation-based BERT Model for
Multimodal NER
- Authors: Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng
- Abstract summary: multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets.
We introduce a method of text-image relation propagation into the multimodal BERT model.
We propose a multitask algorithm to train on the MNER datasets.
- Score: 4.510210055307459
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recently multimodal named entity recognition (MNER) has utilized images to
improve the accuracy of NER in tweets. However, most of the multimodal methods
use attention mechanisms to extract visual clues regardless of whether the text
and image are relevant. Practically, the irrelevant text-image pairs account
for a large proportion in tweets. The visual clues that are unrelated to the
texts will exert uncertain or even negative effects on multimodal model
learning. In this paper, we introduce a method of text-image relation
propagation into the multimodal BERT model. We integrate soft or hard gates to
select visual clues and propose a multitask algorithm to train on the MNER
datasets. In the experiments, we deeply analyze the changes in visual attention
before and after the use of text-image relation propagation. Our model achieves
state-of-the-art performance on the MNER datasets.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset.
It contains 39,153 text-rich images, captions, and 102,437 questions.
We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [38.08486689940946]
Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention.
It is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality.
In this paper, we propose bf Image-bf text bf Alignments (ITA) to align image features into the textual space.
arXiv Detail & Related papers (2021-12-13T08:29:43Z) - FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning.
We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes.
Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z) - Can images help recognize entities? A study of the role of images for
Multimodal NER [20.574849371747685]
Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context.
While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood.
arXiv Detail & Related papers (2020-10-23T23:41:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.