Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation
- URL: http://arxiv.org/abs/2501.14166v1
- Date: Fri, 24 Jan 2025 01:35:10 GMT
- Title: Enhancing Multimodal Entity Linking with Jaccard Distance-based Conditional Contrastive Learning and Contextual Visual Augmentation
- Authors: Cong-Duy Nguyen, Xiaobao Wu, Thong Nguyen, Shuai Zhao, Khoi Le, Viet-Anh Nguyen, Feng Yichao, Anh Tuan Luu,
- Abstract summary: We propose JD-CCL (Jaccard Distance-based Contrastive Learning), a novel approach to enhance the ability to match multimodal entity linking models.<n>To address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Con Visual-aid Controllable Patch Transform)
- Score: 37.22528391940295
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Previous research on multimodal entity linking (MEL) has primarily employed contrastive learning as the primary objective. However, using the rest of the batch as negative samples without careful consideration, these studies risk leveraging easy features and potentially overlook essential details that make entities unique. In this work, we propose JD-CCL (Jaccard Distance-based Conditional Contrastive Learning), a novel approach designed to enhance the ability to match multimodal entity linking models. JD-CCL leverages meta-information to select negative samples with similar attributes, making the linking task more challenging and robust. Additionally, to address the limitations caused by the variations within the visual modality among mentions and entities, we introduce a novel method, CVaCPT (Contextual Visual-aid Controllable Patch Transform). It enhances visual representations by incorporating multi-view synthetic images and contextual textual representations to scale and shift patch representations. Experimental results on benchmark MEL datasets demonstrate the strong effectiveness of our approach.
Related papers
- XR-VLM: Cross-Relationship Modeling with Multi-part Prompts and Visual Features for Fine-Grained Recognition [20.989787824067143]
XR-VLM is a novel mechanism to discover subtle differences by modeling cross-relationships.
We develop a multi-part prompt learning module to capture multi-perspective descriptions.
Our method achieves significant improvements compared to current state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-10T08:58:05Z) - Embracing Diversity: A Multi-Perspective Approach with Soft Labels [3.529000007777341]
We propose a new framework for designing perspective-aware models on stance detection task, in which multiple annotators assign stances based on a controversial topic.
Results show that the multi-perspective approach yields better classification performance (higher F1-scores)
arXiv Detail & Related papers (2025-03-01T13:33:38Z) - Contextual Interaction via Primitive-based Adversarial Training For Compositional Zero-shot Learning [23.757252768668497]
Compositional Zero-shot Learning (CZSL) aims to identify novel compositions via known attribute-object pairs.
The primary challenge in CZSL tasks lies in the significant discrepancies introduced by the complex interaction between the visual primitives of attribute and object.
We propose a model-agnostic and Primitive-Based Adversarial training (PBadv) method to deal with this problem.
arXiv Detail & Related papers (2024-06-21T08:18:30Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - C-ICL: Contrastive In-context Learning for Information Extraction [54.39470114243744]
c-ICL is a novel few-shot technique that leverages both correct and incorrect sample constructions to create in-context learning demonstrations.
Our experiments on various datasets indicate that c-ICL outperforms previous few-shot in-context learning methods.
arXiv Detail & Related papers (2024-02-17T11:28:08Z) - Vision-Enhanced Semantic Entity Recognition in Document Images via
Visually-Asymmetric Consistency Learning [19.28860833813788]
Existing models commonly train a visual encoder with weak cross-modal supervision signals.
We propose a novel textbfVisually-textbfAsymmetric cotextbfNsistentextbfCy textbfLearning (textscVancl) approach to capture fine-grained visual and layout features.
arXiv Detail & Related papers (2023-10-23T10:37:22Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Learning Visual Representation from Modality-Shared Contrastive
Language-Image Pre-training [88.80694147730883]
We investigate a variety of Modality-Shared Contrastive Language-Image Pre-training (MS-CLIP) frameworks.
In studied conditions, we observe that a mostly unified encoder for vision and language signals outperforms all other variations that separate more parameters.
Our approach outperforms vanilla CLIP by 1.6 points in linear probing on a collection of 24 downstream vision tasks.
arXiv Detail & Related papers (2022-07-26T05:19:16Z) - Entity-Graph Enhanced Cross-Modal Pretraining for Instance-level Product
Retrieval [152.3504607706575]
This research aims to conduct weakly-supervised multi-modal instance-level product retrieval for fine-grained product categories.
We first contribute the Product1M datasets, and define two real practical instance-level retrieval tasks.
We exploit to train a more effective cross-modal model which is adaptively capable of incorporating key concept information from the multi-modal data.
arXiv Detail & Related papers (2022-06-17T15:40:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.