Related papers: MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model

URL: http://arxiv.org/abs/2210.05335v3
Date: Thu, 20 Jul 2023 16:24:14 GMT
Title: MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Authors: Yatai Ji, Junjie Wang, Yuan Gong, Lin Zhang, Yanru Zhu, Hongfa Wang, Jiaxing Zhang, Tetsuya Sakai, Yujiu Yang
Abstract summary: We project the representations of all modalities as probabilistic distributions via a Probability Distribution (PDE) Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information. We propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM)
Score: 35.52349231889843
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results.

Related papers

Explaining multimodal LLMs via intra-modal token interactions [55.27436637894534]
Multimodal Large Language Models (MLLMs) have achieved remarkable success across diverse vision-language tasks, yet their internal decision-making mechanisms remain insufficiently understood.<n>We propose enhancing interpretability by leveraging intra-modal interaction.
arXiv Detail & Related papers (2025-09-26T14:39:13Z)
Diagnosing and Mitigating Modality Interference in Multimodal Large Language Models [28.20124264650572]
Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities across tasks.<n>They often exhibit difficulty in distinguishing task-relevant from irrelevant signals, particularly in tasks like Visual Question Answering (VQA)<n>This vulnerability becomes more evident in modality-specific tasks such as image classification or pure text question answering.<n>We propose a novel framework to fine-tune MLLMs, including perturbation-based data augmentation with both perturbations and adversarial perturbations.
arXiv Detail & Related papers (2025-05-26T07:31:32Z)
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models [9.47743870776814]
Vision-Language Models (VLMs) learn joint representations by mapping images and text into a shared latent space.<n>GroVE builds on the Gaussian Process Latent Variable Model (GPLVM) to learn a shared low-dimensional latent space where image and text inputs are mapped to a unified representation.<n>GroVE achieves state-of-the-art uncertainty calibration across multiple downstream tasks, including cross-modal retrieval, visual question answering, and active learning.
arXiv Detail & Related papers (2025-05-08T11:57:35Z)
Latent Distribution Decoupling: A Probabilistic Framework for Uncertainty-Aware Multimodal Emotion Recognition [7.25361375272096]
Multimodal multi-label emotion recognition aims to identify the concurrent presence of multiple emotions in multimodal data. Existing studies overlook the impact of textbfaleatoric uncertainty, which is the inherent noise in the multimodal data. This paper proposes Latent emotional Distribution Decomposition with Uncertainty perception framework.
arXiv Detail & Related papers (2025-02-19T18:53:23Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework. Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss. We show that MMAR demonstrates much more superior performance than other joint multi-modal models.
arXiv Detail & Related papers (2024-10-14T17:57:18Z)
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges. This often leads to the issue of missing modalities, where data for certain modalities are absent. We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC) UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z)
Modeling Multimodal Aleatoric Uncertainty in Segmentation with Mixture of Stochastic Expert [24.216869988183092]
We focus on capturing the data-inherent uncertainty (aka aleatoric uncertainty) in segmentation, typically when ambiguities exist in input images. We propose a novel mixture of experts (MoSE) model, where each expert network estimates a distinct mode of aleatoric uncertainty. We develop a Wasserstein-like loss that directly minimizes the distribution distance between the MoSE and ground truth annotations.
arXiv Detail & Related papers (2022-12-14T16:48:21Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
Discriminative Multimodal Learning via Conditional Priors in Generative Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z)
Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text. These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining. We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z)
Modal Uncertainty Estimation via Discrete Latent Representation [4.246061945756033]
We introduce a deep learning framework that learns the one-to-many mappings between the inputs and outputs, together with faithful uncertainty measures. Our framework demonstrates significantly more accurate uncertainty estimation than the current state-of-the-art methods.
arXiv Detail & Related papers (2020-07-25T05:29:34Z)
Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction. We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data. Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.