Related papers: RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER

RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER

URL: http://arxiv.org/abs/2102.02967v1
Date: Fri, 5 Feb 2021 02:45:30 GMT
Title: RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER
Authors: Lin Sun, Jiquan Wang, Kai Zhang, Yindu Su, and Fangsheng Weng
Abstract summary: multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. We introduce a method of text-image relation propagation into the multimodal BERT model. We propose a multitask algorithm to train on the MNER datasets.
Score: 4.510210055307459
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recently multimodal named entity recognition (MNER) has utilized images to improve the accuracy of NER in tweets. However, most of the multimodal methods use attention mechanisms to extract visual clues regardless of whether the text and image are relevant. Practically, the irrelevant text-image pairs account for a large proportion in tweets. The visual clues that are unrelated to the texts will exert uncertain or even negative effects on multimodal model learning. In this paper, we introduce a method of text-image relation propagation into the multimodal BERT model. We integrate soft or hard gates to select visual clues and propose a multitask algorithm to train on the MNER datasets. In the experiments, we deeply analyze the changes in visual attention before and after the use of text-image relation propagation. Our model achieves state-of-the-art performance on the MNER datasets.

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
TRINS: Towards Multimodal Language Models that Can Read [61.17806538631744]
TRINS is a Text-Rich image INStruction dataset. It contains 39,153 text-rich images, captions, and 102,437 questions. We introduce a Language-vision Reading Assistant (LaRA) which is good at understanding textual content within images.
arXiv Detail & Related papers (2024-06-10T18:52:37Z)
MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching. We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data. Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z)
Learning Comprehensive Representations with Richer Self for Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text. Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view. We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z)
Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework. We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image. We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z)
Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE. We propose a novel Multi-modal Retrieval based framework (MoRe) MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z)
ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition [38.08486689940946]
Multi-modal Named Entity Recognition (MNER) has attracted a lot of attention. It is difficult to model such interactions as image and text representations are trained separately on the data of their respective modality. In this paper, we propose bf Image-bf text bf Alignments (ITA) to align image features into the textual space.
arXiv Detail & Related papers (2021-12-13T08:29:43Z)
FiLMing Multimodal Sarcasm Detection with Attention [0.7340017786387767]
Sarcasm detection identifies natural language expressions whose intended meaning is different from what is implied by its surface meaning. We propose a novel architecture that uses the RoBERTa model with a co-attention layer on top to incorporate context incongruity between input text and image attributes. Our results demonstrate that our proposed model outperforms the existing state-of-the-art method by 6.14% F1 score on the public Twitter multimodal detection dataset.
arXiv Detail & Related papers (2021-08-09T06:33:29Z)
Can images help recognize entities? A study of the role of images for Multimodal NER [20.574849371747685]
Multimodal named entity recognition (MNER) requires to bridge the gap between language understanding and visual context. While many multimodal neural techniques have been proposed to incorporate images into the MNER task, the model's ability to leverage multimodal interactions remains poorly understood.
arXiv Detail & Related papers (2020-10-23T23:41:51Z)
Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects. The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image. We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z)
A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation [131.33610549540043]
We propose a novel graph-based multi-modal fusion encoder for NMT. We first represent the input sentence and image using a unified multi-modal graph. We then stack multiple graph-based multi-modal fusion layers that iteratively perform semantic interactions to learn node representations.
arXiv Detail & Related papers (2020-07-17T04:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.