Related papers: How Far Are We from Predicting Missing Modalities with Foundation Models?

How Far Are We from Predicting Missing Modalities with Foundation Models?

URL: http://arxiv.org/abs/2506.03530v1
Date: Wed, 04 Jun 2025 03:22:44 GMT
Title: How Far Are We from Predicting Missing Modalities with Foundation Models?
Authors: Guanzhou Ke, Yi Xie, Xiaoli Wang, Guoqing Chao, Bo Wang, Shengfeng He,
Abstract summary: Current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities.<n>This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features.<n> Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.
Score: 31.853781353441242
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality prediction remains underexplored. To investigate this, we categorize existing approaches into three representative paradigms, encompassing a total of 42 model variants, and conduct a comprehensive evaluation in terms of prediction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned predictions. To address these challenges, we propose an agentic framework tailored for missing modality prediction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a \textit{self-refinement mechanism}, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image prediction by at least 14% and MER for missing text prediction by at least 10% compared to baselines.

Related papers

SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival Analysis [8.413107141283502]
Survival analysis is fundamental in numerous real-world applications, particularly in high-stakes domains such as healthcare and risk assessment.<n>Despite advances in numerous survival models, quantifying the uncertainty of predictions remains underexplored and challenging.<n>We introduce SurvUnc, a novel meta-model based framework for post-hoc uncertainty quantification for survival models.
arXiv Detail & Related papers (2025-05-20T18:12:20Z)
Are you SURE? Enhancing Multimodal Pretraining with Missing Modalities through Uncertainty Estimation [12.459901557580052]
We present SURE, a novel framework that extends the capabilities of pretrained multimodal models by introducing latent space reconstruction and uncertainty estimation.<n>We show that SURE consistently achieves state-of-the-art performance, ensuring robust predictions even in the presence of incomplete data.
arXiv Detail & Related papers (2025-04-18T05:07:20Z)
Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance Applications [0.0]
We frame predictive multiplicity as a critical concern in survival-based models.<n>We introduce formal measures -- ambiguity, discrepancy, and obscurity -- to quantify it.<n>This is particularly relevant for downstream tasks such as maintenance scheduling.
arXiv Detail & Related papers (2025-04-16T15:04:00Z)
Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations [80.86128012438834]
We show for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete. We propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees.
arXiv Detail & Related papers (2024-07-10T09:13:11Z)
eXplainable Bayesian Multi-Perspective Generative Retrieval [6.823521786512908]
We introduce uncertainty calibration and interpretability into a retrieval pipeline. We incorporate techniques such as LIME and SHAP to analyze the behavior of a black-box reranker model. Our methods demonstrate substantial performance improvements across three KILT datasets.
arXiv Detail & Related papers (2024-02-04T09:34:13Z)
Structured Radial Basis Function Network: Modelling Diversity for Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions. A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems. It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z)
Uncertainty Estimation of Transformers' Predictions via Topological Analysis of the Attention Matrices [3.1466086042810884]
Transformer-based language models have set new benchmarks across a wide range of NLP tasks. reliably estimating the uncertainty of their predictions remains a significant challenge. We tackle these limitations by harnessing the geometry of attention maps across multiple heads and layers to assess model confidence. Our method significantly outperforms existing uncertainty estimation techniques on benchmarks for acceptability judgments and artificial text detection.
arXiv Detail & Related papers (2023-08-22T09:17:45Z)
Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap. We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z)
Attentional Prototype Inference for Few-Shot Segmentation [128.45753577331422]
We propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation. We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution. We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods.
arXiv Detail & Related papers (2021-05-14T06:58:44Z)
Towards Trustworthy Predictions from Deep Neural Networks with Fast Adversarial Calibration [2.8935588665357077]
We propose an efficient yet general modelling approach for obtaining well-calibrated, trustworthy probabilities for samples obtained after a domain shift. We introduce a new training strategy combining an entropy-encouraging loss term with an adversarial calibration loss term and demonstrate that this results in well-calibrated and technically trustworthy predictions.
arXiv Detail & Related papers (2020-12-20T13:39:29Z)
Trust but Verify: Assigning Prediction Credibility by Counterfactual Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning. These measures should account for the wide variety of models used in practice. The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z)
Accurate and Robust Feature Importance Estimation under Distribution Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method. We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.