Enhancing Multimodal Compositional Reasoning of Visual Language Models
with Generative Negative Mining
- URL: http://arxiv.org/abs/2311.03964v1
- Date: Tue, 7 Nov 2023 13:05:47 GMT
- Title: Enhancing Multimodal Compositional Reasoning of Visual Language Models
with Generative Negative Mining
- Authors: Ugur Sahin, Hang Li, Qadeer Khan, Daniel Cremers, Volker Tresp
- Abstract summary: Large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks.
We propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities.
Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
- Score: 58.379339799777064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary large-scale visual language models (VLMs) exhibit strong
representation capacities, making them ubiquitous for enhancing image and text
understanding tasks. They are often trained in a contrastive manner on a large
and diverse corpus of images and corresponding text captions scraped from the
internet. Despite this, VLMs often struggle with compositional reasoning tasks
which require a fine-grained understanding of the complex interactions of
objects and their attributes. This failure can be attributed to two main
factors: 1) Contrastive approaches have traditionally focused on mining
negative examples from existing datasets. However, the mined negative examples
might not be difficult for the model to discriminate from the positive. An
alternative to mining would be negative sample generation 2) But existing
generative approaches primarily focus on generating hard negative texts
associated with a given image. Mining in the other direction, i.e., generating
negative image samples associated with a given text has been ignored. To
overcome both these limitations, we propose a framework that not only mines in
both directions but also generates challenging negative samples in both
modalities, i.e., images and texts. Leveraging these generative hard negative
samples, we significantly enhance VLMs' performance in tasks involving
multimodal compositional reasoning. Our code and dataset are released at
https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
Related papers
- Conan-embedding: General Text Embedding with More and Better Negative Samples [30.571206231457932]
We propose a conan-embedding model, which maximizes the utilization of more and higher-quality negative examples.
Our approach effectively enhances the capabilities of embedding models, currently ranking first on the Chinese leaderboard of Massive text embedding benchmark.
arXiv Detail & Related papers (2024-08-28T11:18:06Z) - Generating Enhanced Negatives for Training Language-Based Object Detectors [86.1914216335631]
We propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data.
Specifically, we use large-language-models to generate negative text descriptions, and text-to-image diffusion models to also generate corresponding negative images.
Our experimental analysis confirms the relevance of the generated negative data, and its use in language-based detectors improves performance on two complex benchmarks.
arXiv Detail & Related papers (2023-12-29T23:04:00Z) - Negative Sample is Negative in Its Own Way: Tailoring Negative Sentences
for Image-Text Retrieval [19.161248757493386]
We propose our TAiloring neGative Sentences with Discrimination and Correction (TAGS-DC) to generate synthetic sentences automatically as negative samples.
To keep the difficulty during training, we mutually improve the retrieval and generation through parameter sharing.
In experiments, we verify the effectiveness of our model on MS-COCO and Flickr30K compared with current state-of-the-art models.
arXiv Detail & Related papers (2021-11-05T09:36:41Z) - Robust Contrastive Learning Using Negative Samples with Diminished
Semantics [23.38896719740166]
We show that by generating carefully designed negative samples, contrastive learning can learn more robust representations.
We develop two methods, texture-based and patch-based augmentations, to generate negative samples.
We also analyze our method and the generated texture-based samples, showing that texture features are indispensable in classifying particular ImageNet classes.
arXiv Detail & Related papers (2021-10-27T05:38:00Z) - Instance-wise Hard Negative Example Generation for Contrastive Learning
in Unpaired Image-to-Image Translation [102.99799162482283]
We present instance-wise hard Negative Example Generation for Contrastive learning in Unpaired image-to-image Translation (NEGCUT)
Specifically, we train a generator to produce negative examples online. The generator is novel from two perspectives: 1) it is instance-wise which means that the generated examples are based on the input image, and 2) it can generate hard negative examples since it is trained with an adversarial loss.
arXiv Detail & Related papers (2021-08-10T09:44:59Z) - Contrastive Learning with Adversarial Perturbations for Conditional Text
Generation [49.055659008469284]
We propose a principled method to generate positive and negative samples for contrastive learning of seq2seq models.
Specifically, we generate negative examples by adding small perturbations to the input sequence to minimize its conditional likelihood.
We empirically show that our proposed method significantly improves the generalization of the seq2seq on three text generation tasks.
arXiv Detail & Related papers (2020-12-14T06:20:27Z) - Video Understanding as Machine Translation [53.59298393079866]
We tackle a wide variety of downstream video understanding tasks by means of a single unified framework.
We report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT)
arXiv Detail & Related papers (2020-06-12T14:07:04Z) - Adaptive Offline Quintuplet Loss for Image-Text Matching [102.50814151323965]
Existing image-text matching approaches typically leverage triplet loss with online hard negatives to train the model.
We propose solutions by sampling negatives offline from the whole training set.
We evaluate the proposed training approach on three state-of-the-art image-text models on the MS-COCO and Flickr30K datasets.
arXiv Detail & Related papers (2020-03-07T22:09:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.