Visual Named Entity Linking: A New Dataset and A Baseline
- URL: http://arxiv.org/abs/2211.04872v1
- Date: Wed, 9 Nov 2022 13:27:50 GMT
- Title: Visual Named Entity Linking: A New Dataset and A Baseline
- Authors: Wenxiang Sun, Yixing Fan, Jiafeng Guo, Ruqing Zhang, Xueqi Cheng
- Abstract summary: We consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image.
We propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL)
We present a high-quality human-annotated visual person linking dataset, named WIKIPerson.
- Score: 61.38231023490981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Entity Linking (VEL) is a task to link regions of images with their
corresponding entities in Knowledge Bases (KBs), which is beneficial for many
computer vision tasks such as image retrieval, image caption, and visual
question answering. While existing tasks in VEL either rely on textual data to
complement a multi-modal linking or only link objects with general entities,
which fails to perform named entity linking on large amounts of image data. In
this paper, we consider a purely Visual-based Named Entity Linking (VNEL) task,
where the input only consists of an image. The task is to identify objects of
interest (i.e., visual entity mentions) in images and link them to
corresponding named entities in KBs. Since each entity often contains rich
visual and textual information in KBs, we thus propose three different
sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual
entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL).
In addition, we present a high-quality human-annotated visual person linking
dataset, named WIKIPerson. Based on WIKIPerson, we establish a series of
baseline algorithms for the solution of each sub-task, and conduct experiments
to verify the quality of proposed datasets and the effectiveness of baseline
methods. We envision this work to be helpful for soliciting more works
regarding VNEL in the future. The codes and datasets are publicly available at
https://github.com/ict-bigdatalab/VNEL.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - DWE+: Dual-Way Matching Enhanced Framework for Multimodal Entity Linking [16.728006492769666]
We propose DWE+ for multimodal entity linking.
DWE+ could capture finer semantics and dynamically maintain semantic consistency with entities.
Experiments on Wikimel, Richpedia, and Wikidiverse datasets demonstrate the effectiveness of DWE+ in improving MEL performance.
arXiv Detail & Related papers (2024-04-07T05:56:42Z) - Table and Image Generation for Investigating Knowledge of Entities in
Pre-trained Vision and Language Models [31.865208971014336]
We propose a task to verify how knowledge about entities acquired from natural language is retained in Vision & Language (V&L) models.
This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption.
We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks.
arXiv Detail & Related papers (2023-06-03T14:01:54Z) - One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - Multimodal Entity Linking for Tweets [6.439761523935613]
multimodal entity linking (MEL) is an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB)
We propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB.
Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts.
arXiv Detail & Related papers (2021-04-07T16:40:23Z) - Visual Pivoting for (Unsupervised) Entity Alignment [93.82387952905756]
This work studies the use of visual semantic representations to align entities in heterogeneous knowledge graphs (KGs)
We show that the proposed new approach, EVA, creates a holistic entity representation that provides strong signals for cross-graph entity alignment.
arXiv Detail & Related papers (2020-09-28T20:09:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.