Multimodal Entity Tagging with Multimodal Knowledge Base
- URL: http://arxiv.org/abs/2201.00693v1
- Date: Tue, 21 Dec 2021 15:04:57 GMT
- Title: Multimodal Entity Tagging with Multimodal Knowledge Base
- Authors: Hao Peng, Hang Li, Lei Hou, Juanzi Li, Chao Qiao
- Abstract summary: We propose a new task called multimodal entity tagging (MET) with a multimodal knowledge base (MKB)
In MET, given a text-image pair, one uses the information in the MKB to automatically identify the related entity in the text-image pair.
We conduct extensive experiments and make analyses on the experimental results.
- Score: 45.84732232595781
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To enhance research on multimodal knowledge base and multimodal information
processing, we propose a new task called multimodal entity tagging (MET) with a
multimodal knowledge base (MKB). We also develop a dataset for the problem
using an existing MKB. In an MKB, there are entities and their associated texts
and images. In MET, given a text-image pair, one uses the information in the
MKB to automatically identify the related entity in the text-image pair. We
solve the task by using the information retrieval paradigm and implement
several baselines using state-of-the-art methods in NLP and CV. We conduct
extensive experiments and make analyses on the experimental results. The
results show that the task is challenging, but current technologies can achieve
relatively high performance. We will release the dataset, code, and models for
future research.
Related papers
- Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
This dataset includes figures such as schematic diagrams, simulated images, macroscopic/microscopic photos, and experimental visualizations.
We developed benchmarks for scientific figure captioning and multiple-choice questions, evaluating six proprietary and over ten open-source models.
The dataset and benchmarks will be released to support further research.
arXiv Detail & Related papers (2024-07-06T00:40:53Z) - Exploring the Capabilities of Large Multimodal Models on Dense Text [58.82262549456294]
We propose the DT-VQA dataset, with 170k question-answer pairs.
In this paper, we conduct a comprehensive evaluation of GPT4V, Gemini, and various open-source LMMs.
We find that even with automatically labeled training datasets, significant improvements in model performance can be achieved.
arXiv Detail & Related papers (2024-05-09T07:47:25Z) - Generative Multi-Modal Knowledge Retrieval with Large Language Models [75.70313858231833]
We propose an innovative end-to-end generative framework for multi-modal knowledge retrieval.
Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases.
We demonstrate significant improvements ranging from 3.0% to 14.6% across all evaluation metrics when compared to strong baselines.
arXiv Detail & Related papers (2024-01-16T08:44:29Z) - WikiDiverse: A Multimodal Entity Linking Dataset with Diversified
Contextual Topics and Entity Types [25.569170440376165]
Multimodal Entity Linking (MEL) aims at linking mentions with multimodal contexts to referent entities from a knowledge base (e.g., Wikipedia)
We present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews.
Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented.
arXiv Detail & Related papers (2022-04-13T12:52:40Z) - Single-Modal Entropy based Active Learning for Visual Question Answering [75.1682163844354]
We address Active Learning in the multi-modal setting of Visual Question Answering (VQA)
In light of the multi-modal inputs, image and question, we propose a novel method for effective sample acquisition.
Our novel idea is simple to implement, cost-efficient, and readily adaptable to other multi-modal tasks.
arXiv Detail & Related papers (2021-10-21T05:38:45Z) - MELINDA: A Multimodal Dataset for Biomedical Experiment Method
Classification [14.820951153262685]
We introduce a new dataset, MELINDA, for Multimodal biomEdicaL experImeNt methoD clAssification.
The dataset is collected in a fully automated distant supervision manner, where the labels are obtained from an existing curated database.
We benchmark various state-of-the-art NLP and computer vision models, including unimodal models which only take either caption texts or images as inputs.
arXiv Detail & Related papers (2020-12-16T19:11:36Z) - MLM: A Benchmark Dataset for Multitask Learning with Multiple Languages
and Modalities [14.605385352491904]
dataset is designed for researchers and developers who build applications that perform multiple tasks on data encountered on web and in digital archives.
A second version provides a geo-representative subset of the data with weighted samples for countries of the European Union.
arXiv Detail & Related papers (2020-08-14T14:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.