WikiDiverse: A Multimodal Entity Linking Dataset with Diversified
Contextual Topics and Entity Types
- URL: http://arxiv.org/abs/2204.06347v1
- Date: Wed, 13 Apr 2022 12:52:40 GMT
- Title: WikiDiverse: A Multimodal Entity Linking Dataset with Diversified
Contextual Topics and Entity Types
- Authors: Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Rui Wang, Ming Yan, Lihan
Chen, Yanghua Xiao
- Abstract summary: Multimodal Entity Linking (MEL) aims at linking mentions with multimodal contexts to referent entities from a knowledge base (e.g., Wikipedia)
We present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews.
Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented.
- Score: 25.569170440376165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal Entity Linking (MEL) which aims at linking mentions with
multimodal contexts to the referent entities from a knowledge base (e.g.,
Wikipedia), is an essential task for many multimodal applications. Although
much attention has been paid to MEL, the shortcomings of existing MEL datasets
including limited contextual topics and entity types, simplified mention
ambiguity, and restricted availability, have caused great obstacles to the
research and application of MEL. In this paper, we present WikiDiverse, a
high-quality human-annotated MEL dataset with diversified contextual topics and
entity types from Wikinews, which uses Wikipedia as the corresponding knowledge
base. A well-tailored annotation procedure is adopted to ensure the quality of
the dataset. Based on WikiDiverse, a sequence of well-designed MEL models with
intra-modality and inter-modality attentions are implemented, which utilize the
visual information of images more adequately than existing MEL models do.
Extensive experimental analyses are conducted to investigate the contributions
of different modalities in terms of MEL, facilitating the future research on
this task. The dataset and baseline models are available at
https://github.com/wangxw5/wikiDiverse.
Related papers
- FedMLLM: Federated Fine-tuning MLLM on Multimodal Heterogeneity Data [64.50893177169996]
Fine-tuning Multimodal Large Language Models (MLLMs) with Federated Learning (FL) allows for expanding the training data scope by including private data sources.
We introduce a benchmark for evaluating various downstream tasks in the federated fine-tuning of MLLMs within multimodal heterogeneous scenarios.
We develop a general FedMLLM framework that integrates four representative FL methods alongside two modality-agnostic strategies.
arXiv Detail & Related papers (2024-11-22T04:09:23Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - UniMEL: A Unified Framework for Multimodal Entity Linking with Large Language Models [0.42832989850721054]
Multimodal Entities Linking (MEL) is a crucial task that aims at linking ambiguous mentions within multimodal contexts to referent entities in a multimodal knowledge base, such as Wikipedia.
Existing methods overcomplicate the MEL task and overlook the visual semantic information, which makes them costly and hard to scale.
We propose UniMEL, a unified framework which establishes a new paradigm to process multimodal entity linking tasks using Large Language Models.
arXiv Detail & Related papers (2024-07-23T03:58:08Z) - Semantic-Aware Representation of Multi-Modal Data for Data Ingress: A Literature Review [1.8590097948961688]
Generative AI such as Large Language Models (LLMs) sees broad adoption to process multi-modal data such as text, images, audio, and video.
Managing this data efficiently has become a significant practical challenge in the industry-double as much data is not double as good.
This study focuses on the different semantic-aware techniques to extract embeddings from mono-modal, multi-modal, and cross-modal data.
arXiv Detail & Related papers (2024-07-17T09:49:11Z) - MESED: A Multi-modal Entity Set Expansion Dataset with Fine-grained
Semantic Classes and Hard Negative Entities [25.059177235004952]
We propose Multi-modal Entity Set Expansion (MESE), where models integrate information from multiple modalities to represent entities.
A powerful multi-modal model MultiExpan is proposed which is pre-trained on four multimodal pre-training tasks.
The MESED dataset is the first multi-modal dataset for ESE with large-scale and elaborate manual calibration.
arXiv Detail & Related papers (2023-07-27T14:09:59Z) - Multi-Grained Multimodal Interaction Network for Entity Linking [65.30260033700338]
Multimodal entity linking task aims at resolving ambiguous mentions to a multimodal knowledge graph.
We propose a novel Multi-GraIned Multimodal InteraCtion Network $textbf(MIMIC)$ framework for solving the MEL task.
arXiv Detail & Related papers (2023-07-19T02:11:19Z) - Multi-modal Contrastive Representation Learning for Entity Alignment [57.92705405276161]
Multi-modal entity alignment aims to identify equivalent entities between two different multi-modal knowledge graphs.
We propose MCLEA, a Multi-modal Contrastive Learning based Entity Alignment model.
In particular, MCLEA firstly learns multiple individual representations from multiple modalities, and then performs contrastive learning to jointly model intra-modal and inter-modal interactions.
arXiv Detail & Related papers (2022-09-02T08:59:57Z) - Multimodal Entity Tagging with Multimodal Knowledge Base [45.84732232595781]
We propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB)
In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair.
We conduct extensive experiments and make analyses on the experimental results.
arXiv Detail & Related papers (2021-12-21T15:04:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.