AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes
- URL: http://arxiv.org/abs/2305.14725v1
- Date: Wed, 24 May 2023 05:01:48 GMT
- Title: AMELI: Enhancing Multimodal Entity Linking with Fine-Grained Attributes
- Authors: Barry Menglong Yao, Yu Chen, Qifan Wang, Sijia Wang, Minqian Liu,
Zhiyang Xu, Licheng Yu, Lifu Huang
- Abstract summary: We propose attribute-aware multimodal entity linking, where the input is a mention described with a text and image.
The goal is to predict the corresponding target entity from a multimodal knowledge base.
To support this research, we construct AMELI, a large-scale dataset consisting of 18,472 reviews and 35,598 products.
- Score: 22.158388220889865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose attribute-aware multimodal entity linking, where the input is a
mention described with a text and image, and the goal is to predict the
corresponding target entity from a multimodal knowledge base (KB) where each
entity is also described with a text description, a visual image and a set of
attributes and values. To support this research, we construct AMELI, a
large-scale dataset consisting of 18,472 reviews and 35,598 products. To
establish baseline performance on AMELI, we experiment with the current
state-of-the-art multimodal entity linking approaches and our enhanced
attribute-aware model and demonstrate the importance of incorporating the
attribute information into the entity linking process. To be best of our
knowledge, we are the first to build benchmark dataset and solutions for the
attribute-aware multimodal entity linking task. Datasets and codes will be made
publicly available.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes.
ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation.
This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z) - Leveraging Entity Information for Cross-Modality Correlation Learning: The Entity-Guided Multimodal Summarization [49.08348604716746]
Multimodal Summarization with Multimodal Output (MSMO) aims to produce a multimodal summary that integrates both text and relevant images.
In this paper, we propose an Entity-Guided Multimodal Summarization model (EGMS)
Our model, building on BART, utilizes dual multimodal encoders with shared weights to process text-image and entity-image information concurrently.
arXiv Detail & Related papers (2024-08-06T12:45:56Z) - MAC: A Benchmark for Multiple Attributes Compositional Zero-Shot Learning [33.12021227971062]
Compositional Zero-Shot Learning (CZSL) aims to learn semantic primitives (attributes and objects) from seen neglecting and recognize unseen attribute-object compositions.
We introduce the Multi-Attribute Composition dataset, encompassing 18,217 images and 11,067 compositions with comprehensive, representative, and diverse attribute annotations.
Our dataset supports deeper semantic understanding and higher-order attribute associations, providing a more realistic and challenging benchmark for the CZSL task.
arXiv Detail & Related papers (2024-06-18T16:24:48Z) - EIVEN: Efficient Implicit Attribute Value Extraction using Multimodal LLM [52.016009472409166]
EIVEN is a data- and parameter-efficient generative framework for implicit attribute value extraction.
We introduce a novel Learning-by-Comparison technique to reduce model confusion.
Our experiments reveal that EIVEN significantly outperforms existing methods in extracting implicit attribute values.
arXiv Detail & Related papers (2024-04-13T03:15:56Z) - Attribute-Consistent Knowledge Graph Representation Learning for
Multi-Modal Entity Alignment [14.658282035561792]
We propose a novel attribute-consistent knowledge graph representation learning framework for MMEA (ACK-MMEA)
Our approach achieves excellent performance compared to its competitors.
arXiv Detail & Related papers (2023-04-04T06:39:36Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - AdaTag: Multi-Attribute Value Extraction from Product Profiles with
Adaptive Decoding [55.89773725577615]
We present AdaTag, which uses adaptive decoding to handle attribute extraction.
Our experiments on a real-world e-Commerce dataset show marked improvements over previous methods.
arXiv Detail & Related papers (2021-06-04T07:54:11Z) - Multimodal Entity Linking for Tweets [6.439761523935613]
multimodal entity linking (MEL) is an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB)
We propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB.
Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts.
arXiv Detail & Related papers (2021-04-07T16:40:23Z) - Multimodal Joint Attribute Prediction and Value Extraction for
E-commerce Product [40.46223408546036]
Product attribute values are essential in many e-commerce scenarios, such as customer service robots, product recommendations, and product retrieval.
While in the real world, the attribute values of a product are usually incomplete and vary over time, which greatly hinders the practical applications.
We propose a multimodal method to jointly predict product attributes and extract values from textual product descriptions with the help of the product images.
arXiv Detail & Related papers (2020-09-15T15:10:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.