How Far Are We from Generating Missing Modalities with Foundation Models?
- URL: http://arxiv.org/abs/2506.03530v2
- Date: Mon, 11 Aug 2025 06:25:52 GMT
- Title: How Far Are We from Generating Missing Modalities with Foundation Models?
- Authors: Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He,
- Abstract summary: We propose an agentic framework tailored for missing modality reconstruction.<n>Our method reduces FID for missing image reconstruction by at least 14% and MER for missing text reconstruction by at least 10% compared to baselines.
- Score: 49.425856207329524
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.
Related papers
- OMG-Agent: Toward Robust Missing Modality Generation with Decoupled Coarse-to-Fine Agentic Workflows [9.617220633655716]
We present textbfunderlineOmni-textbfunderlineModality textbfunderlineGeneration Agent (textbfOMG-Agent)
arXiv Detail & Related papers (2026-02-04T02:25:40Z) - Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features.<n>Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment.
arXiv Detail & Related papers (2026-02-03T06:06:35Z) - SurvUnc: A Meta-Model Based Uncertainty Quantification Framework for Survival Analysis [8.413107141283502]
Survival analysis is fundamental in numerous real-world applications, particularly in high-stakes domains such as healthcare and risk assessment.<n>Despite advances in numerous survival models, quantifying the uncertainty of predictions remains underexplored and challenging.<n>We introduce SurvUnc, a novel meta-model based framework for post-hoc uncertainty quantification for survival models.
arXiv Detail & Related papers (2025-05-20T18:12:20Z) - Are you SURE? Enhancing Multimodal Pretraining with Missing Modalities through Uncertainty Estimation [12.459901557580052]
We present SURE, a novel framework that extends the capabilities of pretrained multimodal models by introducing latent space reconstruction and uncertainty estimation.<n>We show that SURE consistently achieves state-of-the-art performance, ensuring robust predictions even in the presence of incomplete data.
arXiv Detail & Related papers (2025-04-18T05:07:20Z) - Predictive Multiplicity in Survival Models: A Method for Quantifying Model Uncertainty in Predictive Maintenance Applications [0.0]
We frame predictive multiplicity as a critical concern in survival-based models.<n>We introduce formal measures -- ambiguity, discrepancy, and obscurity -- to quantify it.<n>This is particularly relevant for downstream tasks such as maintenance scheduling.
arXiv Detail & Related papers (2025-04-16T15:04:00Z) - Constrained Auto-Regressive Decoding Constrains Generative Retrieval [71.71161220261655]
Generative retrieval seeks to replace traditional search index data structures with a single large-scale neural network.<n>In this paper, we examine the inherent limitations of constrained auto-regressive generation from two essential perspectives: constraints and beam search.
arXiv Detail & Related papers (2025-04-14T06:54:49Z) - Rigorous Probabilistic Guarantees for Robust Counterfactual Explanations [80.86128012438834]
We show for the first time that computing the robustness of counterfactuals with respect to plausible model shifts is NP-complete.
We propose a novel probabilistic approach which is able to provide tight estimates of robustness with strong guarantees.
arXiv Detail & Related papers (2024-07-10T09:13:11Z) - Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization [14.606035444283984]
Current approaches focus on developing models that handle modality-incomplete inputs during inference.
We propose a robust universal model with modality reconstruction and model personalization.
Our method has been extensively validated on two brain tumor segmentation benchmarks.
arXiv Detail & Related papers (2024-06-04T06:07:24Z) - Solving Inverse Problems with Model Mismatch using Untrained Neural Networks within Model-based Architectures [14.551812310439004]
We introduce an untrained forward model residual block within the model-based architecture to match the data consistency in the measurement domain for each instance.
Our approach offers a unified solution that is less parameter-sensitive, requires no additional data, and enables simultaneous fitting of the forward model and reconstruction in a single pass.
arXiv Detail & Related papers (2024-03-07T19:02:13Z) - eXplainable Bayesian Multi-Perspective Generative Retrieval [6.823521786512908]
We introduce uncertainty calibration and interpretability into a retrieval pipeline.
We incorporate techniques such as LIME and SHAP to analyze the behavior of a black-box reranker model.
Our methods demonstrate substantial performance improvements across three KILT datasets.
arXiv Detail & Related papers (2024-02-04T09:34:13Z) - Structured Radial Basis Function Network: Modelling Diversity for
Multiple Hypotheses Prediction [51.82628081279621]
Multi-modal regression is important in forecasting nonstationary processes or with a complex mixture of distributions.
A Structured Radial Basis Function Network is presented as an ensemble of multiple hypotheses predictors for regression problems.
It is proved that this structured model can efficiently interpolate this tessellation and approximate the multiple hypotheses target distribution.
arXiv Detail & Related papers (2023-09-02T01:27:53Z) - Steerable Conditional Diffusion for Out-of-Distribution Adaptation in Medical Image Reconstruction [75.91471250967703]
We introduce a novel sampling framework called Steerable Conditional Diffusion.<n>This framework adapts the diffusion model, concurrently with image reconstruction, based solely on the information provided by the available measurement.<n>We achieve substantial enhancements in out-of-distribution performance across diverse imaging modalities.
arXiv Detail & Related papers (2023-08-28T08:47:06Z) - Uncertainty Estimation of Transformers' Predictions via Topological Analysis of the Attention Matrices [3.1466086042810884]
Transformer-based language models have set new benchmarks across a wide range of NLP tasks.
reliably estimating the uncertainty of their predictions remains a significant challenge.
We tackle these limitations by harnessing the geometry of attention maps across multiple heads and layers to assess model confidence.
Our method significantly outperforms existing uncertainty estimation techniques on benchmarks for acceptability judgments and artificial text detection.
arXiv Detail & Related papers (2023-08-22T09:17:45Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - HandFlow: Quantifying View-Dependent 3D Ambiguity in Two-Hand
Reconstruction with Normalizing Flow [73.7895717883622]
We explicitly model the distribution of plausible reconstructions in a conditional normalizing flow framework.
We show that explicit ambiguity modeling is better-suited for this challenging problem.
arXiv Detail & Related papers (2022-10-04T15:42:22Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Attentional Prototype Inference for Few-Shot Segmentation [128.45753577331422]
We propose attentional prototype inference (API), a probabilistic latent variable framework for few-shot segmentation.
We define a global latent variable to represent the prototype of each object category, which we model as a probabilistic distribution.
We conduct extensive experiments on four benchmarks, where our proposal obtains at least competitive and often better performance than state-of-the-art prototype-based methods.
arXiv Detail & Related papers (2021-05-14T06:58:44Z) - Towards Trustworthy Predictions from Deep Neural Networks with Fast
Adversarial Calibration [2.8935588665357077]
We propose an efficient yet general modelling approach for obtaining well-calibrated, trustworthy probabilities for samples obtained after a domain shift.
We introduce a new training strategy combining an entropy-encouraging loss term with an adversarial calibration loss term and demonstrate that this results in well-calibrated and technically trustworthy predictions.
arXiv Detail & Related papers (2020-12-20T13:39:29Z) - Trust but Verify: Assigning Prediction Credibility by Counterfactual
Constrained Learning [123.3472310767721]
Prediction credibility measures are fundamental in statistics and machine learning.
These measures should account for the wide variety of models used in practice.
The framework developed in this work expresses the credibility as a risk-fit trade-off.
arXiv Detail & Related papers (2020-11-24T19:52:38Z) - Accurate and Robust Feature Importance Estimation under Distribution
Shifts [49.58991359544005]
PRoFILE is a novel feature importance estimation method.
We show significant improvements over state-of-the-art approaches, both in terms of fidelity and robustness.
arXiv Detail & Related papers (2020-09-30T05:29:01Z) - Quantifying Model Uncertainty in Inverse Problems via Bayesian Deep
Gradient Descent [4.029853654012035]
Recent advances in inverse problems leverage powerful data-driven models, e.g., deep neural networks.
We develop a scalable, data-driven, knowledge-aided computational framework to quantify the model uncertainty via Bayesian neural networks.
arXiv Detail & Related papers (2020-07-20T09:43:31Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.