Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering
- URL: http://arxiv.org/abs/2209.06954v3
- Date: Sat, 6 May 2023 17:35:48 GMT
- Title: Correlation Information Bottleneck: Towards Adapting Pretrained
Multimodal Models for Robust Visual Question Answering
- Authors: Jingjing Jiang, Ziyi Liu, Nanning Zheng
- Abstract summary: Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations.
We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
- Score: 63.87200781247364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Benefiting from large-scale pretrained vision language models (VLMs), the
performance of visual question answering (VQA) has approached human oracles.
However, finetuning such models on limited data often suffers from overfitting
and poor generalization issues, leading to a lack of model robustness. In this
paper, we aim to improve input robustness from an information bottleneck
perspective when adapting pretrained VLMs to the downstream VQA task. Input
robustness refers to the ability of models to defend against visual and
linguistic input variations, as well as shortcut learning involved in inputs.
Generally, the representations obtained by pretrained VLMs inevitably contain
irrelevant and redundant information for a specific downstream task, resulting
in statistically spurious correlations and insensitivity to input variations.
To encourage representations to converge to a minimal sufficient statistic in
multimodal learning, we propose Correlation Information Bottleneck (CIB), which
seeks a tradeoff between compression and redundancy in representations by
minimizing the mutual information (MI) between inputs and representations while
maximizing the MI between outputs and representations. Moreover, we derive a
tight theoretical upper bound for the mutual information between multimodal
inputs and representations, incorporating different internal correlations that
guide models to learn more robust representations and facilitate modality
alignment. Extensive experiments consistently demonstrate the effectiveness and
superiority of the proposed CIB in terms of input robustness and accuracy.
Related papers
- Visual Contexts Clarify Ambiguous Expressions: A Benchmark Dataset [0.39462888523270856]
We propose VAGUE, a multimodal benchmark comprising 3.9K indirect human utterances paired with corresponding scenes.
Our work aims to delve deeper into the ability of models to understand indirect communication and seek to contribute to the development of models capable of more refined and human-like interactions.
arXiv Detail & Related papers (2024-11-21T14:01:42Z) - RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training [55.54020926284334]
Multimodal Large Language Models (MLLMs) have recently received substantial interest, which shows their emerging potential as general-purpose models for various vision-language tasks.
Retrieval augmentation techniques have proven to be effective plugins for both LLMs and MLLMs.
In this study, we propose multimodal adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training (RA-BLIP), a novel retrieval-augmented framework for various MLLMs.
arXiv Detail & Related papers (2024-10-18T03:45:19Z) - Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities.
A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations.
We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z) - Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly.
Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness.
Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings.
This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z) - Enhancing Multiple Reliability Measures via Nuisance-extended
Information Bottleneck [77.37409441129995]
In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition.
We consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training.
We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training.
arXiv Detail & Related papers (2023-03-24T16:03:21Z) - Adaptive Contrastive Learning on Multimodal Transformer for Review
Helpfulness Predictions [40.70793282367128]
We propose Multimodal Contrastive Learning for Multimodal Review Helpfulness Prediction (MRHP) problem.
In addition, we introduce Adaptive Weighting scheme for our contrastive learning approach.
Finally, we propose Multimodal Interaction module to address the unalignment nature of multimodal data.
arXiv Detail & Related papers (2022-11-07T13:05:56Z) - MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model [35.52349231889843]
We project the representations of all modalities as probabilistic distributions via a Probability Distribution (PDE)
Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information.
We propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM)
arXiv Detail & Related papers (2022-10-11T10:54:54Z) - Adaptive Discrete Communication Bottlenecks with Dynamic Vector
Quantization [76.68866368409216]
We propose learning to dynamically select discretization tightness conditioned on inputs.
We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks.
arXiv Detail & Related papers (2022-02-02T23:54:26Z) - Multi-Task Variational Information Bottleneck [8.55293326934818]
Multi-task learning (MTL) is an important subject in machine learning and artificial intelligence.
This article proposes an MTL model based on the architecture of the variational information bottleneck (VIB)
Extensive observations on three public data sets under adversarial attacks show that the proposed model is competitive to the state-of-the-art algorithms.
arXiv Detail & Related papers (2020-07-01T09:06:20Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.