VP-MEL: Visual Prompts Guided Multimodal Entity Linking
- URL: http://arxiv.org/abs/2412.06720v4
- Date: Sat, 15 Feb 2025 13:30:14 GMT
- Title: VP-MEL: Visual Prompts Guided Multimodal Entity Linking
- Authors: Hongze Mi, Jinyuan Li, Xuying Zhang, Haoran Cheng, Jiahao Wang, Di Sun, Gang Pan,
- Abstract summary: Multimodal entity linking (MEL) is a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB)<n>Existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text.<n>We propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information.
- Score: 16.463229055333407
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multimodal entity linking (MEL), a task aimed at linking mentions within multimodal contexts to their corresponding entities in a knowledge base (KB), has attracted much attention due to its wide applications in recent years. However, existing MEL methods often rely on mention words as retrieval cues, which limits their ability to effectively utilize information from both images and text. This reliance causes MEL to struggle with accurately retrieving entities in certain scenarios, especially when the focus is on image objects or mention words are missing from the text. To solve these issues, we introduce a Visual Prompts guided Multimodal Entity Linking (VP-MEL) task. Given a text-image pair, VP-MEL aims to link a marked region (i.e., visual prompt) in an image to its corresponding entities in the knowledge base. To facilitate this task, we present a new dataset, VPWiki, specifically designed for VP-MEL. Furthermore, we propose a framework named IIER, which enhances visual feature extraction using visual prompts and leverages the pretrained Detective-VLM model to capture latent information. Experimental results on the VPWiki dataset demonstrate that IIER outperforms baseline methods across multiple benchmarks for the VP-MEL task.
Related papers
- Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences.
We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries.
We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z) - Enhanced Multimodal RAG-LLM for Accurate Visual Question Answering [10.505845766495128]
Multimodal large language models (MLLMs) have made significant progress in integrating visual and textual modalities.
We propose a novel framework based on multimodal retrieval-augmented generation (RAG)
RAG introduces structured scene graphs to enhance object recognition, relationship identification, and spatial understanding within images.
arXiv Detail & Related papers (2024-12-30T13:16:08Z) - Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Multimodal LLM Enhanced Cross-lingual Cross-modal Retrieval [40.83470534691711]
Cross-lingual cross-modal retrieval ( CCR) aims to retrieve visually relevant content based on non-English queries.
One popular approach involves utilizing machine translation (MT) to create pseudo-parallel data pairs.
We propose LE CCR, a novel solution that incorporates the multi-modal large language model (MLLM) to improve the alignment between visual and non-English representations.
arXiv Detail & Related papers (2024-09-30T05:25:51Z) - EAGLE: Towards Efficient Arbitrary Referring Visual Prompts Comprehension for Multimodal Large Language Models [80.00303150568696]
We propose a novel Multimodal Large Language Models (MLLM) that empowers comprehension of arbitrary referring visual prompts with less training efforts than existing approaches.
Our approach embeds referring visual prompts as spatial concepts conveying specific spatial areas comprehensible to the MLLM.
We also propose a Geometry-Agnostic Learning paradigm (GAL) to further disentangle the MLLM's region-level comprehension with the specific formats of referring visual prompts.
arXiv Detail & Related papers (2024-09-25T08:22:00Z) - Semantic Alignment for Multimodal Large Language Models [72.10272479476161]
We introduce Semantic Alignment for Multi-modal large language models (SAM)
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
By involving the bidirectional semantic guidance between different images in the visual-token extraction process, SAM aims to enhance the preservation of linking information for coherent analysis.
arXiv Detail & Related papers (2024-08-23T06:48:46Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - NoteLLM-2: Multimodal Large Representation Models for Recommendation [71.87790090964734]
Large Language Models (LLMs) have demonstrated exceptional proficiency in text understanding and embedding tasks.
Their potential in multimodal representation, particularly for item-to-item (I2I) recommendations, remains underexplored.
We propose an end-to-end fine-tuning method that customizes the integration of any existing LLMs and vision encoders for efficient multimodal representation.
arXiv Detail & Related papers (2024-05-27T03:24:01Z) - Bridging Vision and Language Spaces with Assignment Prediction [47.04855334955006]
VLAP is a novel approach that bridges pretrained vision models and large language models (LLMs)
We harness well-established word embeddings to bridge two modality embedding spaces.
VLAP achieves substantial improvements over the previous linear transformation-based approaches.
arXiv Detail & Related papers (2024-04-15T10:04:15Z) - Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want [58.091825321168514]
We introduce the Draw-and-Understand project: a new model, a multi-domain dataset, and a challenging benchmark for visual prompting.
Specifically, we propose a new end-to-end trained Multimodal Large Language Model (MLLM) that connects a vision encoder, a visual prompt encoder and an LLM.
To advance visual prompting research for MLLMs, we introduce MDVP-Data and MDVP-Bench.
arXiv Detail & Related papers (2024-03-29T16:26:20Z) - Few-Shot Relation Extraction with Hybrid Visual Evidence [3.154631846975021]
We propose a multi-modal few-shot relation extraction model (MFS-HVE)
MFS-HVE includes semantic feature extractors and multi-modal fusion components.
Experiments conducted on two public datasets demonstrate that semantic visual information significantly improves the performance of few-shot relation prediction.
arXiv Detail & Related papers (2024-03-01T18:20:11Z) - Resolving References in Visually-Grounded Dialogue via Text Generation [3.8673630752805446]
Vision-language models (VLMs) have shown to be effective at image retrieval based on simple text queries, but text-image retrieval based on conversational input remains a challenge.
We propose fine-tuning a causal large language model (LLM) to generate definite descriptions that summarize coreferential information found in the linguistic context of references.
We then use a pretrained VLM to identify referents based on the generated descriptions, zero-shot.
arXiv Detail & Related papers (2023-09-23T17:07:54Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z) - Visual Named Entity Linking: A New Dataset and A Baseline [61.38231023490981]
We consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image.
We propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL)
We present a high-quality human-annotated visual person linking dataset, named WIKIPerson.
arXiv Detail & Related papers (2022-11-09T13:27:50Z) - See Finer, See More: Implicit Modality Alignment for Text-based Person
Retrieval [19.687373765453643]
We introduce an Implicit Visual-Textual (IVT) framework for text-based person retrieval.
IVT utilizes a single network to learn representation for both modalities, which contributes to the visual-textual interaction.
arXiv Detail & Related papers (2022-08-18T03:04:37Z) - Multimodal Entity Linking for Tweets [6.439761523935613]
multimodal entity linking (MEL) is an emerging research field in which textual and visual information is used to map an ambiguous mention to an entity in a knowledge base (KB)
We propose a method for building a fully annotated Twitter dataset for MEL, where entities are defined in a Twitter KB.
Then, we propose a model for jointly learning a representation of both mentions and entities from their textual and visual contexts.
arXiv Detail & Related papers (2021-04-07T16:40:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.