Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media
- URL: http://arxiv.org/abs/2210.14163v1
- Date: Wed, 19 Oct 2022 15:14:55 GMT
- Title: Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media
- Authors: Peipei Liu, Gaosheng Wang, Hong Li, Jie Liu, Yimo Ren, Hongsong Zhu,
Limin Sun
- Abstract summary: Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
- Score: 11.235498285650142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named Entity Recognition (NER) on social media refers to discovering and
classifying entities from unstructured free-form content, and it plays an
important role for various applications such as intention understanding and
user recommendation. With social media posts tending to be multimodal,
Multimodal Named Entity Recognition (MNER) for the text with its accompanying
image is attracting more and more attention since some textual components can
only be understood in combination with visual information. However, there are
two drawbacks in existing approaches: 1) Meanings of the text and its
accompanying image do not match always, so the text information still plays a
major role. However, social media posts are usually shorter and more informal
compared with other normal contents, which easily causes incomplete semantic
description and the data sparsity problem. 2) Although the visual
representations of whole images or objects are already used, existing methods
ignore either fine-grained semantic correspondence between objects in images
and words in text or the objective fact that there are misleading objects or no
objects in some images. In this work, we solve the above two problems by
introducing the multi-granularity cross-modality representation learning. To
resolve the first problem, we enhance the representation by semantic
augmentation for each word in text. As for the second issue, we perform the
cross-modality semantic interaction between text and vision at the different
vision granularity to get the most effective multimodal guidance representation
for every word. Experiments show that our proposed approach can achieve the
SOTA or approximate SOTA performance on two benchmark datasets of tweets. The
code, data and the best performing models are available at
https://github.com/LiuPeiP-CS/IIE4MNER
Related papers
- Knowing Where to Focus: Attention-Guided Alignment for Text-based Person Search [64.15205542003056]
We introduce Attention-Guided Alignment (AGA) framework featuring two innovative components: Attention-Guided Mask (AGM) Modeling and Text Enrichment Module (TEM)
AGA achieves new state-of-the-art results with Rank-1 accuracy reaching 78.36%, 67.31%, and 67.4% on CUHK-PEDES, ICFG-PEDES, and RSTP, respectively.
arXiv Detail & Related papers (2024-12-19T17:51:49Z) - Embedding and Enriching Explicit Semantics for Visible-Infrared Person Re-Identification [31.011118085494942]
Visible-infrared person re-identification (VIReID) retrieves pedestrian images with the same identity across different modalities.
Existing methods learn visual content solely from images, lacking the capability to sense high-level semantics.
We propose an Embedding and Enriching Explicit Semantics framework to learn semantically rich cross-modality pedestrian representations.
arXiv Detail & Related papers (2024-12-11T14:27:30Z) - MVAM: Multi-View Attention Method for Fine-grained Image-Text Matching [65.87255122130188]
We propose a Multi-view Attention Method (MVAM) for image-text matching.
We also incorporate an objective to explicitly encourage attention heads to focus on distinct aspects of the input data.
Our method allows models to encode images and text from different perspectives and focus on more critical details, leading to better matching performance.
arXiv Detail & Related papers (2024-02-27T06:11:54Z) - A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Improving Multimodal Classification of Social Media Posts by Leveraging
Image-Text Auxiliary Tasks [38.943074586111564]
We present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models.
First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post.
Second, Image-Text Matching (ITM) enhances the model's ability to understand the semantic relationship between images and text.
arXiv Detail & Related papers (2023-09-14T15:30:59Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z) - A multimodal deep learning approach for named entity recognition from
social media [1.9511777443446214]
We propose two novel deep learning approaches utilizing multimodal deep learning and Transformers.
Both of our approaches use image features from short social media posts to provide better results on the NER task.
The experimental results, namely, the precision, recall and F1 score metrics show the superiority of our work compared to other state-of-the-art NER solutions.
arXiv Detail & Related papers (2020-01-19T19:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.