VP-MEL: Visual Prompts Guided Multimodal Entity Linking
- URL: http://arxiv.org/abs/2412.06720v4
- Date: Sat, 15 Feb 2025 13:30:14 GMT
- Title: VP-MEL: Visual Prompts Guided Multimodal Entity Linking
- Authors: Hongze Mi, Jinyuan Li, Xuying Zhang, Haoran Cheng, Jiahao Wang, Di Sun, Gang Pan,
- Abstract summary: Multimodal entity linking (MEL) is a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB)
Existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text.
We propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information.
- Score: 16.463229055333407
- License:
- Abstract: Multimodal entity linking (MEL), a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB), has attracted much attention due to its wide applications in recent years. However, existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. This reliance causes MEL to struggle with accurately retrieving entities in certain scenarios, especially when the focus is on image objects or mention words are missing from the text. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual prompt) in an image to its corresponding entities in the knowledge base. To facilitate this task, we present a new dataset, VPWiki, specifically designed for VP-MEL. Furthermore, we propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information. Experimental results on the VPWiki dataset demonstrate that IIER outperforms baseline methods across multiple benchmarks for the VP-MEL task.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - ImageRef-VL: Enabling Contextual Image Referencing in Vision-Language Models [15.907584884933414]
We introduce Contextual Image Reference -- the ability to appropriately reference relevant images from retrieval documents based on conversation context.
We propose ImageRef-VL, a method that significantly enhances open-source VLMs' image referencing capabilities.
Experimental results demonstrate that ImageRef-VL not only outperforms proprietary models but also achieves an 88% performance improvement over state-of-the-art open-source VLMs.
arXiv Detail & Related papers (2025-01-20T13:43:45Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Few-Shot Relation Extraction with Hybrid Visual Evidence [3.154631846975021]
We propose a multi-modal few-shot relation extraction model (MFS-HVE)
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
Experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
arXiv Detail & Related papers (2024-03-01T18:20:11Z) - Multimodal Entity Linking for Tweets [6.439761523935613]
multimodal entity linking (MEL) is an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB)
We propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB.
Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts.
arXiv Detail & Related papers (2021-04-07T16:40:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.