Related papers: ViLU: Learning Vision-Language Uncertainties for Failure Prediction

ViLU: Learning Vision-Language Uncertainties for Failure Prediction

URL: http://arxiv.org/abs/2507.07620v3
Date: Thu, 24 Jul 2025 09:03:07 GMT
Title: ViLU: Learning Vision-Language Uncertainties for Failure Prediction
Authors: Marc Lafon, Yannis Karmim, Julio Silva-Rodríguez, Paul Couairon, Clément Rambour, Raphaël Fournier-Sniehotta, Ismail Ben Ayed, Jose Dolz, Nicolas Thome,
Abstract summary: We introduce ViLU, a new Vision-Language Uncertainty quantification framework.<n>ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention.<n>Our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself.
Score: 28.439422629957424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reliable Uncertainty Quantification (UQ) and failure prediction remain open challenges for Vision-Language Models (VLMs). We introduce ViLU, a new Vision-Language Uncertainty quantification framework that contextualizes uncertainty estimates by leveraging all task-relevant textual representations. ViLU constructs an uncertainty-aware multi-modal representation by integrating the visual embedding, the predicted textual embedding, and an image-conditioned textual representation via cross-attention. Unlike traditional UQ methods based on loss prediction, ViLU trains an uncertainty predictor as a binary classifier to distinguish correct from incorrect predictions using a weighted binary cross-entropy loss, making it loss-agnostic. In particular, our proposed approach is well-suited for post-hoc settings, where only vision and text embeddings are available without direct access to the model itself. Extensive experiments on diverse datasets show the significant gains of our method compared to state-of-the-art failure prediction methods. We apply our method to standard classification datasets, such as ImageNet-1k, as well as large-scale image-caption datasets like CC12M and LAION-400M. Ablation studies highlight the critical role of our architecture and training in achieving effective uncertainty quantification. Our code is publicly available and can be found here: https://github.com/ykrmm/ViLU.

Related papers

Audio-Driven Talking Face Video Generation with Joint Uncertainty Learning [11.551314848756107]
We propose a Joint Uncertainty Learning Network (JULNet) for high-quality talking face video generation.<n>We first design an uncertainty module to individually predict the error map and uncertainty map after obtaining the generated image.<n>By jointly optimizing error and uncertainty, the performance and robustness of our model can be enhanced.
arXiv Detail & Related papers (2025-04-26T05:45:38Z)
Words or Vision: Do Vision-Language Models Have Blind Faith in Text? [34.88114876390461]
Vision-Language Models (VLMs) excel in integrating visual and textual information for vision-centric tasks.<n>We investigate VLMs' modality preferences when faced with visual data and varied textual inputs in vision-centered settings.<n>We discover a emphblind faith in text'' phenomenon:VLMs disproportionately trust textual data over visual data when inconsistencies arise.
arXiv Detail & Related papers (2025-03-04T02:21:07Z)
Adaptive Prompt Tuning: Vision Guided Prompt Tuning with Cross-Attention for Fine-Grained Few-Shot Learning [5.242869847419834]
Few-shot, fine-grained classification in computer vision poses significant challenges due to the need to differentiate subtle class distinctions with limited data.<n>This paper presents a novel method that enhances the Contrastive Language-Image Pre-Training model through adaptive prompt tuning.
arXiv Detail & Related papers (2024-12-19T08:51:01Z)
Sample-agnostic Adversarial Perturbation for Vision-Language Pre-training Models [7.350203999073509]
Recent studies on AI security have highlighted the vulnerability of Vision-Language Pre-training models to subtle yet intentionally designed perturbations in images and texts. To the best of our knowledge, it is the first work through multimodal decision boundaries to explore the creation of a universal, sample-agnostic perturbation that applies to any image.
arXiv Detail & Related papers (2024-08-06T06:25:39Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing. Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image. To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
ProbVLM: Probabilistic Adapter for Frozen Vision-Language Models [69.50316788263433]
We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained vision-language models. We quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. We present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.
arXiv Detail & Related papers (2023-07-01T18:16:06Z)
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model [35.52349231889843]
We project the representations of all modalities as probabilistic distributions via a Probability Distribution (PDE) Compared to the existing deterministic methods, such uncertainty modeling can convey richer multimodal semantic information. We propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM)
arXiv Detail & Related papers (2022-10-11T10:54:54Z)
Loss re-scaling VQA: Revisiting the LanguagePrior Problem from a Class-imbalance View [129.392671317356]
We propose to interpret the language prior problem in VQA from a class-imbalance view. It explicitly reveals why the VQA model tends to produce a frequent yet obviously wrong answer. We also justify the validity of the class imbalance interpretation scheme on other computer vision tasks, such as face recognition and image classification.
arXiv Detail & Related papers (2020-10-30T00:57:17Z)
An Uncertainty-based Human-in-the-loop System for Industrial Tool Wear Analysis [68.8204255655161]
We show that uncertainty measures based on Monte-Carlo dropout in the context of a human-in-the-loop system increase the system's transparency and performance. A simulation study demonstrates that the uncertainty-based human-in-the-loop system increases performance for different levels of human involvement.
arXiv Detail & Related papers (2020-07-14T15:47:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.