Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media
- URL: http://arxiv.org/abs/2210.14163v1
- Date: Wed, 19 Oct 2022 15:14:55 GMT
- Title: Multi-Granularity Cross-Modality Representation Learning for Named
Entity Recognition on Social Media
- Authors: Peipei Liu, Gaosheng Wang, Hong Li, Jie Liu, Yimo Ren, Hongsong Zhu,
Limin Sun
- Abstract summary: Named Entity Recognition (NER) on social media refers to discovering and classifying entities from unstructured free-form content.
This work introduces the multi-granularity cross-modality representation learning.
Experiments show that our proposed approach can achieve the SOTA or approximate SOTA performance on two benchmark datasets of tweets.
- Score: 11.235498285650142
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Named Entity Recognition (NER) on social media refers to discovering and
classifying entities from unstructured free-form content, and it plays an
important role for various applications such as intention understanding and
user recommendation. With social media posts tending to be multimodal,
Multimodal Named Entity Recognition (MNER) for the text with its accompanying
image is attracting more and more attention since some textual components can
only be understood in combination with visual information. However, there are
two drawbacks in existing approaches: 1) Meanings of the text and its
accompanying image do not match always, so the text information still plays a
major role. However, social media posts are usually shorter and more informal
compared with other normal contents, which easily causes incomplete semantic
description and the data sparsity problem. 2) Although the visual
representations of whole images or objects are already used, existing methods
ignore either fine-grained semantic correspondence between objects in images
and words in text or the objective fact that there are misleading objects or no
objects in some images. In this work, we solve the above two problems by
introducing the multi-granularity cross-modality representation learning. To
resolve the first problem, we enhance the representation by semantic
augmentation for each word in text. As for the second issue, we perform the
cross-modality semantic interaction between text and vision at the different
vision granularity to get the most effective multimodal guidance representation
for every word. Experiments show that our proposed approach can achieve the
SOTA or approximate SOTA performance on two benchmark datasets of tweets. The
code, data and the best performing models are available at
https://github.com/LiuPeiP-CS/IIE4MNER
Related papers
- A Dual-way Enhanced Framework from Text Matching Point of View for Multimodal Entity Linking [17.847936914174543]
Multimodal Entity Linking (MEL) aims at linking ambiguous mentions with multimodal information to entity in Knowledge Graph (KG) such as Wikipedia.
We formulate multimodal entity linking as a neural text matching problem where each multimodal information (text and image) is treated as a query.
This paper introduces a dual-way enhanced (DWE) framework for MEL.
arXiv Detail & Related papers (2023-12-19T03:15:50Z) - Learning Comprehensive Representations with Richer Self for
Text-to-Image Person Re-Identification [34.289949134802086]
Text-to-image person re-identification (TIReID) retrieves pedestrian images of the same identity based on a query text.
Existing methods for TIReID typically treat it as a one-to-one image-text matching problem, only focusing on the relationship between image-text pairs within a view.
We propose a framework, called LCR$2$S, for modeling many-to-many correspondences of the same identity by learning representations for both modalities from a novel perspective.
arXiv Detail & Related papers (2023-10-17T12:39:16Z) - Improving Multimodal Classification of Social Media Posts by Leveraging
Image-Text Auxiliary Tasks [38.943074586111564]
We present an extensive study on the effectiveness of using two auxiliary losses jointly with the main task during fine-tuning multimodal models.
First, Image-Text Contrastive (ITC) is designed to minimize the distance between image-text representations within a post.
Second, Image-Text Matching (ITM) enhances the model's ability to understand the semantic relationship between images and text.
arXiv Detail & Related papers (2023-09-14T15:30:59Z) - Image-text Retrieval via Preserving Main Semantics of Vision [5.376441473801597]
This paper presents a semantic optimization approach, implemented as a Visual Semantic Loss (VSL)
We leverage the annotated texts corresponding to an image to assist the model in capturing the main content of the image.
Experiments on two benchmark datasets demonstrate the superior performance of our method.
arXiv Detail & Related papers (2023-04-20T12:23:29Z) - Learning to Model Multimodal Semantic Alignment for Story Visualization [58.16484259508973]
Story visualization aims to generate a sequence of images to narrate each sentence in a multi-sentence story.
Current works face the problem of semantic misalignment because of their fixed architecture and diversity of input modalities.
We explore the semantic alignment between text and image representations by learning to match their semantic levels in the GAN-based generative model.
arXiv Detail & Related papers (2022-11-14T11:41:44Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - NewsStories: Illustrating articles with visual summaries [49.924916589209374]
We introduce a large-scale multimodal dataset containing over 31M articles, 22M images and 1M videos.
We show that state-of-the-art image-text alignment methods are not robust to longer narratives with multiple images.
We introduce an intuitive baseline that outperforms these methods on zero-shot image-set retrieval by 10% on the GoodNews dataset.
arXiv Detail & Related papers (2022-07-26T17:34:11Z) - Cross-Media Keyphrase Prediction: A Unified Framework with
Multi-Modality Multi-Head Attention and Image Wordings [63.79979145520512]
We explore the joint effects of texts and images in predicting the keyphrases for a multimedia post.
We propose a novel Multi-Modality Multi-Head Attention (M3H-Att) to capture the intricate cross-media interactions.
Our model significantly outperforms the previous state of the art based on traditional attention networks.
arXiv Detail & Related papers (2020-11-03T08:44:18Z) - Text as Neural Operator: Image Manipulation by Text Instruction [68.53181621741632]
In this paper, we study a setting that allows users to edit an image with multiple objects using complex text instructions to add, remove, or change the objects.
The inputs of the task are multimodal including (1) a reference image and (2) an instruction in natural language that describes desired modifications to the image.
We show that the proposed model performs favorably against recent strong baselines on three public datasets.
arXiv Detail & Related papers (2020-08-11T07:07:10Z) - Expressing Objects just like Words: Recurrent Visual Embedding for
Image-Text Matching [102.62343739435289]
Existing image-text matching approaches infer the similarity of an image-text pair by capturing and aggregating the affinities between the text and each independent object of the image.
We propose a Dual Path Recurrent Neural Network (DP-RNN) which processes images and sentences symmetrically by recurrent neural networks (RNN)
Our model achieves the state-of-the-art performance on Flickr30K dataset and competitive performance on MS-COCO dataset.
arXiv Detail & Related papers (2020-02-20T00:51:01Z) - A multimodal deep learning approach for named entity recognition from
social media [1.9511777443446214]
We propose two novel deep learning approaches utilizing multimodal deep learning and Transformers.
Both of our approaches use image features from short social media posts to provide better results on the NER task.
The experimental results, namely, the precision, recall and F1 score metrics show the superiority of our work compared to other state-of-the-art NER solutions.
arXiv Detail & Related papers (2020-01-19T19:37:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.