MAFA: Managing False Negatives for Vision-Language Pre-training
- URL: http://arxiv.org/abs/2312.06112v2
- Date: Thu, 13 Jun 2024 00:36:31 GMT
- Title: MAFA: Managing False Negatives for Vision-Language Pre-training
- Authors: Jaeseok Byun, Dohoon Kim, Taesup Moon,
- Abstract summary: We consider a critical issue of false negatives in Vision-Language Pre-training.
The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop.
We propose MAFA (MAnaging FAlse negatives), which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy.
- Score: 17.836155361629718
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We consider a critical issue of false negatives in Vision-Language Pre-training (VLP), a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop. To address this challenge, we propose MAFA (MAnaging FAlse negatives), which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy: 1) an efficient connection mining process that identifies and converts false negatives into positives, and 2) label smoothing for the image-text contrastive (ITC) loss. Our comprehensive experiments verify the effectiveness of MAFA across multiple downstream tasks, emphasizing the crucial role of addressing false negatives in VLP, potentially even surpassing the importance of addressing false positives. In addition, the compatibility of MAFA with the recent BLIP-family model is also demonstrated. Code is available at https://github.com/jaeseokbyun/MAFA.
Related papers
- From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model [29.879983760203256]
Multimodal Large Language Models (MLLMs) have emerged as a promising solution for universal embedding tasks.<n>But adapting their generative nature for discriminative representation learning remains a significant challenge.<n>We propose an efficient framework for universal multimodal embeddings, which bridges the gap by centering on two synergistic components.
arXiv Detail & Related papers (2025-08-01T07:31:24Z) - Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models [9.682523487279976]
Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks.<n>Existing methods primarily fine-tune the model by generating text-based hard negative samples.<n>AHNPL translates text-based hard negatives into the visual domain to generate semantically disturbed image-based negatives for training the model.
arXiv Detail & Related papers (2025-05-21T14:28:43Z) - FALCON: False-Negative Aware Learning of Contrastive Negatives in Vision-Language Pretraining [5.200545764106177]
We propose FALCON, a learning-based mini-batch construction strategy that balances the trade-off between hard and false negatives.<n>FALCON employs a negative mining scheduler that dynamically selects negative samples of appropriate hardness for each anchor instance during mini-batch construction.
arXiv Detail & Related papers (2025-05-16T12:50:05Z) - Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation [65.92001420372007]
This paper systematically evaluates state-of-the-art MLLMs across diverse benchmarks.
We introduce the first benchmark GaslightingBench, specifically designed to evaluate the vulnerability of MLLMs to negation arguments.
arXiv Detail & Related papers (2025-01-31T10:37:48Z) - Dissecting Misalignment of Multimodal Large Language Models via Influence Function [12.832792175138241]
We introduce the Extended Influence Function for Contrastive Loss (ECIF), an influence function crafted for contrastive loss.
ECIF considers both positive and negative samples and provides a closed-form approximation of contrastive learning models.
Building upon ECIF, we develop a series of algorithms for data evaluation in MLLM, misalignment detection, and misprediction trace-back tasks.
arXiv Detail & Related papers (2024-11-18T15:45:41Z) - AMRFact: Enhancing Summarization Factuality Evaluation with AMR-Driven Negative Samples Generation [57.8363998797433]
We propose AMRFact, a framework that generates perturbed summaries using Abstract Meaning Representations (AMRs)
Our approach parses factually consistent summaries into AMR graphs and injects controlled factual inconsistencies to create negative examples, allowing for coherent factually inconsistent summaries to be generated with high error-type coverage.
arXiv Detail & Related papers (2023-11-16T02:56:29Z) - Enhancing Multimodal Compositional Reasoning of Visual Language Models
with Generative Negative Mining [58.379339799777064]
Large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks.
We propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities.
Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
arXiv Detail & Related papers (2023-11-07T13:05:47Z) - Your Negative May not Be True Negative: Boosting Image-Text Matching
with False Negative Elimination [62.18768931714238]
We propose a novel False Negative Elimination (FNE) strategy to select negatives via sampling.
The results demonstrate the superiority of our proposed false negative elimination strategy.
arXiv Detail & Related papers (2023-08-08T16:31:43Z) - Improved Probabilistic Image-Text Representations [20.00929281001257]
Image-Text Matching (ITM) task, a fundamental vision-language (VL) task, suffers from the inherent ambiguity arising from multiplicity and imperfect annotations.
This paper presents an improved Probabilistic Cross-Modal Embeddings (named PCME++) by introducing a new probabilistic distance with a closed-form solution.
The robustness of PCME++ is also evaluated under noisy image-text correspondences.
arXiv Detail & Related papers (2023-05-29T16:02:09Z) - Exploiting Pseudo Image Captions for Multimodal Summarization [26.033681302592207]
Cross-modal contrastive learning in vision language pretraining faces the challenge of (partial) false negatives.
We propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images.
arXiv Detail & Related papers (2023-05-09T14:47:25Z) - Language Model Pre-training on True Negatives [109.73819321246062]
Discriminative pre-trained language models (PLMs) learn to predict original texts from intentionally corrupted ones.
Existing PLMs simply treat all corrupted texts as equal negative without any examination.
We design enhanced pre-training methods to counteract false negative predictions and encourage pre-training language models on true negatives.
arXiv Detail & Related papers (2022-12-01T12:24:19Z) - GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language
Pre-training [47.95914618851596]
We show that two routinely applied steps during pre-training have crucial impact on the performance of the pre-trained model.
We propose a new vision and language pre-training method, which adaptively samples mini-batches for more effective mining of hard negative samples for ITM.
Our method achieves a new state-of-the-art performance on various downstream tasks with much less computational cost.
arXiv Detail & Related papers (2022-08-08T11:15:45Z) - Relation-aware Graph Attention Model With Adaptive Self-adversarial
Training [29.240686573485718]
This paper describes an end-to-end solution for the relationship prediction task in heterogeneous, multi-relational graphs.
We particularly address two building blocks in the pipeline, namely heterogeneous graph representation learning and negative sampling.
We introduce a parameter-free negative sampling technique -- adaptive self-adversarial (ASA) negative sampling.
arXiv Detail & Related papers (2021-02-14T16:11:56Z) - Towards Overcoming False Positives in Visual Relationship Detection [95.15011997876606]
We investigate the cause of the high false positive rate in Visual Relationship Detection (VRD)
This paper presents Spatially-Aware Balanced negative pRoposal sAmpling (SABRA) as a robust VRD framework that alleviates the influence of false positives.
arXiv Detail & Related papers (2020-12-23T06:28:00Z) - Simplify and Robustify Negative Sampling for Implicit Collaborative
Filtering [42.832851785261894]
In this paper, we first provide a novel understanding of negative instances by empirically observing that only a few instances are potentially important for model learning.
We then tackle the untouched false negative problem by favouring high-variance samples stored in memory.
Empirical results on two synthetic datasets and three real-world datasets demonstrate both robustness and superiorities of our negative sampling method.
arXiv Detail & Related papers (2020-09-07T19:08:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.