Exploiting Pseudo Image Captions for Multimodal Summarization
- URL: http://arxiv.org/abs/2305.05496v2
- Date: Sat, 24 Feb 2024 04:23:25 GMT
- Title: Exploiting Pseudo Image Captions for Multimodal Summarization
- Authors: Chaoya Jiang, Rui Xie, Wei Ye, Jinan Sun, Shikun Zhang
- Abstract summary: Cross-modal contrastive learning in vision language pretraining faces the challenge of (partial) false negatives.
We propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images.
- Score: 26.033681302592207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Cross-modal contrastive learning in vision language pretraining (VLP) faces
the challenge of (partial) false negatives. In this paper, we study this
problem from the perspective of Mutual Information (MI) optimization. It is
common sense that InfoNCE loss used in contrastive learning will maximize the
lower bound of MI between anchors and their positives, while we theoretically
prove that MI involving negatives also matters when noises commonly exist.
Guided by a more general lower bound form for optimization, we propose a
contrastive learning strategy regulated by progressively refined cross-modal
similarity, to more accurately optimize MI between an image/text anchor and its
negative texts/images instead of improperly minimizing it. Our method performs
competitively on four downstream cross-modal tasks and systematically balances
the beneficial and harmful effects of (partial) false negative samples under
theoretical guidance.
Related papers
- Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning [53.766434746801366]
Multimodal contrastive learning (MCL) has shown remarkable advances in zero-shot classification by learning from millions of image-caption pairs crawled from the Internet.
Hackers may unauthorizedly exploit image-text data for model training, potentially including personal and privacy-sensitive information.
Recent works propose generating unlearnable examples by adding imperceptible perturbations to training images to build shortcuts for protection.
We propose Multi-step Error Minimization (MEM), a novel optimization process for generating multimodal unlearnable examples.
arXiv Detail & Related papers (2024-07-23T09:00:52Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Vision Language Pre-training by Contrastive Learning with Cross-Modal
Similarity Regulation [44.851623239151124]
Cross-modal contrastive learning in vision language pretraining faces the challenge of (partial) false negatives.
We propose a contrastive learning strategy regulated by progressively refined cross-modal similarity, to more accurately optimize MI between an image/text anchor and its negative texts/images.
arXiv Detail & Related papers (2023-05-08T05:53:30Z) - An Information Minimization Based Contrastive Learning Model for
Unsupervised Sentence Embeddings Learning [19.270283247740664]
We present an information minimization based contrastive learning (InforMin-CL) model for unsupervised sentence representation learning.
We find that information minimization can be achieved by simple contrast and reconstruction objectives.
arXiv Detail & Related papers (2022-09-22T12:07:35Z) - Robust Contrastive Learning against Noisy Views [79.71880076439297]
We propose a new contrastive loss function that is robust against noisy views.
We show that our approach provides consistent improvements over the state-of-the-art image, video, and graph contrastive learning benchmarks.
arXiv Detail & Related papers (2022-01-12T05:24:29Z) - Max-Margin Contrastive Learning [120.32963353348674]
We present max-margin contrastive learning (MMCL) for unsupervised representation learning.
Our approach selects negatives as the sparse support vectors obtained via a quadratic optimization problem.
We validate our approach on standard vision benchmark datasets, demonstrating better performance in unsupervised representation learning.
arXiv Detail & Related papers (2021-12-21T18:56:54Z) - Contrastive Learning of Visual-Semantic Embeddings [4.7464518249313805]
We propose two loss functions based on normalized cross-entropy to perform the task of learning joint visual-semantic embedding.
We compare our results with existing visual-semantic embedding methods on cross-modal image-to-text and text-to-image retrieval tasks.
arXiv Detail & Related papers (2021-10-17T17:28:04Z) - Investigating the Role of Negatives in Contrastive Representation
Learning [59.30700308648194]
Noise contrastive learning is a popular technique for unsupervised representation learning.
We focus on disambiguating the role of one of these parameters: the number of negative examples.
We find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives.
arXiv Detail & Related papers (2021-06-18T06:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.