PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
- URL: http://arxiv.org/abs/2510.02726v1
- Date: Fri, 03 Oct 2025 05:09:47 GMT
- Title: PGMEL: Policy Gradient-based Generative Adversarial Network for Multimodal Entity Linking
- Authors: KM Pooja, Cheng Long, Aixin Sun,
- Abstract summary: We propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL)<n> Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples.
- Score: 32.06340010145227
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of entity linking, which involves associating mentions with their respective entities in a knowledge graph, has received significant attention due to its numerous potential applications. Recently, various multimodal entity linking (MEL) techniques have been proposed, targeted to learn comprehensive embeddings by leveraging both text and vision modalities. The selection of high-quality negative samples can potentially play a crucial role in metric/representation learning. However, to the best of our knowledge, this possibility remains unexplored in existing literature within the framework of MEL. To fill this gap, we address the multimodal entity linking problem in a generative adversarial setting where the generator is responsible for generating high-quality negative samples, and the discriminator is assigned the responsibility for the metric learning tasks. Since the generator is involved in generating samples, which is a discrete process, we optimize it using policy gradient techniques and propose a policy gradient-based generative adversarial network for multimodal entity linking (PGMEL). Experimental results based on Wiki-MEL, Richpedia-MEL and WikiDiverse datasets demonstrate that PGMEL learns meaningful representation by selecting challenging negative samples and outperforms state-of-the-art methods.
Related papers
- From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z) - Insight-A: Attribution-aware for Multimodal Misinformation Detection [14.02125134424451]
We present Insight-A, exploring attribution with MLLM insights for detecting multimodal misinformation.<n>We devise cross-attribution prompting (CAP) to model the sophisticated correlations between perception and reasoning.<n>We also design image captioning (IC) to achieve visual details for enhancing cross-modal consistency checking.
arXiv Detail & Related papers (2025-11-17T02:33:36Z) - Libra-MIL: Multimodal Prototypes Stereoscopic Infused with Task-specific Language Priors for Few-shot Whole Slide Image Classification [18.928408687991368]
Large Language Models (LLMs) are emerging as a promising direction in computational pathology.<n>Existing vision-language Multi-Instance Learning (MIL) methods often employ unidirectional guidance.<n>We introduce a novel approach, Multimodal Prototype-based Multi-Instance Learning, that promotes bidirectional interaction.
arXiv Detail & Related papers (2025-11-11T07:46:38Z) - Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z) - Robust Multi-View Learning via Representation Fusion of Sample-Level Attention and Alignment of Simulated Perturbation [61.64052577026623]
Real-world multi-view datasets are often heterogeneous and imperfect.<n>We propose a novel robust MVL method (namely RML) with simultaneous representation fusion and alignment.<n>Our RML is self-supervised and can also be applied for downstream tasks as a regularization.
arXiv Detail & Related papers (2025-03-06T07:01:08Z) - Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation [37.22528391940295]
We propose JD-CCL (Jaccard Distance-based Contrastive Learning), a novel approach to enhance the ability to match multimodal entity linking models.<n>To address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Con Visual-aid Controllable Patch Transform)
arXiv Detail & Related papers (2025-01-24T01:35:10Z) - A Mutual Information Perspective on Multiple Latent Variable Generative Models for Positive View Generation [38.44755687012033]
In image generation, Multiple Latent Variable Generative Models (MLVGMs) employ multiple latent variables to gradually shape the final images.<n>We propose a novel framework that quantifies the contribution of each latent variable using Mutual Information (MI) as a metric.<n>By leveraging the hierarchical and disentangled variables of MLVGMs, our approach produces diverse and semantically meaningful views without the need for real image data.
arXiv Detail & Related papers (2025-01-23T14:46:38Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z) - WikiDiverse: A Multimodal Entity Linking Dataset with Diversified
Contextual Topics and Entity Types [25.569170440376165]
Multimodal Entity Linking (MEL) aims at linking mentions with multimodal contexts to referent entities from a knowledge base (e.g., Wikipedia)
We present WikiDiverse, a high-quality human-annotated MEL dataset with diversified contextual topics and entity types from Wikinews.
Based on WikiDiverse, a sequence of well-designed MEL models with intra-modality and inter-modality attentions are implemented.
arXiv Detail & Related papers (2022-04-13T12:52:40Z) - A Multi-Semantic Metapath Model for Large Scale Heterogeneous Network
Representation Learning [52.83948119677194]
We propose a multi-semantic metapath (MSM) model for large scale heterogeneous representation learning.
Specifically, we generate multi-semantic metapath-based random walks to construct the heterogeneous neighborhood to handle the unbalanced distributions.
We conduct systematical evaluations for the proposed framework on two challenging datasets: Amazon and Alibaba.
arXiv Detail & Related papers (2020-07-19T22:50:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.