Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
- URL: http://arxiv.org/abs/2403.02875v2
- Date: Mon, 5 Aug 2024 14:01:26 GMT
- Title: Enhancing Conceptual Understanding in Multimodal Contrastive Learning through Hard Negative Samples
- Authors: Philipp J. Rösch, Norbert Oswald, Michaela Geierhos, Jindřich Libovický,
- Abstract summary: We introduce a novel pretraining method incorporating synthetic hard negative text examples.
The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment.
We also introduce InpaintCOCO, a new dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models.
- Score: 0.6249768559720122
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current multimodal models leveraging contrastive learning often face limitations in developing fine-grained conceptual understanding. This is due to random negative samples during pretraining, causing almost exclusively very dissimilar concepts to be compared in the loss function. Consequently, the models struggle with fine-grained semantic differences. To address this problem, we introduce a novel pretraining method incorporating synthetic hard negative text examples. The hard negatives permute terms corresponding to visual concepts, leading to a more fine-grained visual and textual concept alignment. Further, we introduce InpaintCOCO, a new challenging dataset for assessing the fine-grained alignment of colors, objects, and sizes in vision-language models. We created the dataset using generative inpainting from COCO images by changing the visual concepts so that the images no longer match their original captions. Our results show significant improvements in fine-grained concept understanding across a wide range of vision-language datasets, including our InpaintCOCO dataset.
Related papers
- A Simple Graph Contrastive Learning Framework for Short Text Classification [23.36436403062214]
We propose a Simple graph contrastive learning framework for Short Text Classification (SimSTC)
Our method eliminates the need for data augmentation operations to generate contrastive views while still leveraging the benefits of multi-view contrastive learning.
Despite its simplicity, our model achieves outstanding performance, surpassing large language models on various datasets.
arXiv Detail & Related papers (2025-01-16T00:35:56Z) - A New Method to Capturing Compositional Knowledge in Linguistic Space [0.0]
ZS-CU is a novel task that enhances compositional understanding without requiring hard negative training data.
We propose YUKINO, which uses textual inversion to map unlabeled images to pseudo-tokens in a pre-trained CLIP model.
YUKINO outperforms the existing multi-modal SOTA models by over 8% on the SugarCREPE benchmark.
arXiv Detail & Related papers (2024-12-20T07:48:09Z) - Non-confusing Generation of Customized Concepts in Diffusion Models [135.4385383284657]
We tackle the common challenge of inter-concept visual confusion in compositional concept generation using text-guided diffusion models (TGDMs)
Existing customized generation methods only focus on fine-tuning the second stage while overlooking the first one.
We propose a simple yet effective solution called CLIF: contrastive image-language fine-tuning.
arXiv Detail & Related papers (2024-05-11T05:01:53Z) - Continual Contrastive Spoken Language Understanding [33.09005399967931]
COCONUT is a class-incremental learning (CIL) method that relies on the combination of experience replay and contrastive learning.
We show that COCONUT can be combined with methods that operate on the decoder side of the model, resulting in further metrics improvements.
arXiv Detail & Related papers (2023-10-04T10:09:12Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Multi-Modal Representation Learning with Text-Driven Soft Masks [48.19806080407593]
We propose a visual-linguistic representation learning approach within a self-supervised learning framework.
We generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image.
We identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder.
arXiv Detail & Related papers (2023-04-03T05:07:49Z) - A Simple Long-Tailed Recognition Baseline via Vision-Language Model [92.2866546058082]
The visual world naturally exhibits a long-tailed distribution of open classes, which poses great challenges to modern visual systems.
Recent advances in contrastive visual-language pretraining shed light on a new pathway for visual recognition.
We propose BALLAD to leverage contrastive vision-language models for long-tailed recognition.
arXiv Detail & Related papers (2021-11-29T17:49:24Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Co-Grounding Networks with Semantic Attention for Referring Expression
Comprehension in Videos [96.85840365678649]
We tackle the problem of referring expression comprehension in videos with an elegant one-stage framework.
We enhance the single-frame grounding accuracy by semantic attention learning and improve the cross-frame grounding consistency.
Our model is also applicable to referring expression comprehension in images, illustrated by the improved performance on the RefCOCO dataset.
arXiv Detail & Related papers (2021-03-23T06:42:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.