Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
- URL: http://arxiv.org/abs/2410.14975v1
- Date: Sat, 19 Oct 2024 04:46:51 GMT
- Title: Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation
- Authors: Seulbi Lee, Jihyo Kim, Sangheum Hwang,
- Abstract summary: We evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs.
We propose a self-guided prompting approach, termed emphReflexive Guidance (ReGuide), aimed at enhancing the OoDD capability of LVLMs.
Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks.
- Score: 4.506099292980221
- License:
- Abstract: With the recent emergence of foundation models trained on internet-scale data and demonstrating remarkable generalization capabilities, such foundation models have become more widely adopted, leading to an expanding range of application domains. Despite this rapid proliferation, the trustworthiness of foundation models remains underexplored. Specifically, the out-of-distribution detection (OoDD) capabilities of large vision-language models (LVLMs), such as GPT-4o, which are trained on massive multi-modal data, have not been sufficiently addressed. The disparity between their demonstrated potential and practical reliability raises concerns regarding the safe and trustworthy deployment of foundation models. To address this gap, we evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs. Our investigation contributes to a better understanding of how these foundation models represent confidence scores through their generated natural language responses. Based on our observations, we propose a self-guided prompting approach, termed \emph{Reflexive Guidance (ReGuide)}, aimed at enhancing the OoDD capability of LVLMs by leveraging self-generated image-adaptive concept suggestions. Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks.
Related papers
- Graph-based Unsupervised Disentangled Representation Learning via Multimodal Large Language Models [42.17166746027585]
We introduce a bidirectional weighted graph-based framework to learn factorized attributes and their interrelations within complex data.
Specifically, we propose a $beta$-VAE based module to extract factors as the initial nodes of the graph.
By integrating these complementary modules, our model successfully achieves fine-grained, practical and unsupervised disentanglement.
arXiv Detail & Related papers (2024-07-26T15:32:21Z) - Regularizing Hidden States Enables Learning Generalizable Reward Model for LLMs [25.011675414622392]
This study introduces a novel approach to enhance the reward model's generalization ability against distribution shifts.
We retain the base model's language model head and incorporate a suite of text-generation losses to preserve the hidden states' text-generation capabilities.
Our experimental results demonstrate that the introduced regularization technique markedly improves the accuracy of learned reward models.
arXiv Detail & Related papers (2024-06-14T17:49:59Z) - Debiasing Multimodal Large Language Models [61.6896704217147]
Large Vision-Language Models (LVLMs) have become indispensable tools in computer vision and natural language processing.
Our investigation reveals a noteworthy bias in the generated content, where the output is primarily influenced by the underlying Large Language Models (LLMs) prior to the input image.
To rectify these biases and redirect the model's focus toward vision information, we introduce two simple, training-free strategies.
arXiv Detail & Related papers (2024-03-08T12:35:07Z) - Causality-based Cross-Modal Representation Learning for
Vision-and-Language Navigation [15.058687283978077]
Vision-and-Language Navigation (VLN) has gained significant research interest in recent years due to its potential applications in real-world scenarios.
Existing VLN methods struggle with the issue of spurious associations, resulting in poor generalization with a significant performance gap between seen and unseen environments.
We propose a unified framework CausalVLN based on the causal learning paradigm to train a robust navigator capable of learning unbiased feature representations.
arXiv Detail & Related papers (2024-03-06T02:01:38Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.
We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - Bridging Generative and Discriminative Models for Unified Visual
Perception with Diffusion Priors [56.82596340418697]
We propose a simple yet effective framework comprising a pre-trained Stable Diffusion (SD) model containing rich generative priors, a unified head (U-head) capable of integrating hierarchical representations, and an adapted expert providing discriminative priors.
Comprehensive investigations unveil potential characteristics of Vermouth, such as varying granularity of perception concealed in latent variables at distinct time steps and various U-net stages.
The promising results demonstrate the potential of diffusion models as formidable learners, establishing their significance in furnishing informative and robust visual representations.
arXiv Detail & Related papers (2024-01-29T10:36:57Z) - Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs.
We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency.
We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z) - Consistency Regularization for Generalizable Source-free Domain
Adaptation [62.654883736925456]
Source-free domain adaptation (SFDA) aims to adapt a well-trained source model to an unlabelled target domain without accessing the source dataset.
Existing SFDA methods ONLY assess their adapted models on the target training set, neglecting the data from unseen but identically distributed testing sets.
We propose a consistency regularization framework to develop a more generalizable SFDA method.
arXiv Detail & Related papers (2023-08-03T07:45:53Z) - The Open-World Lottery Ticket Hypothesis for OOD Intent Classification [68.93357975024773]
We shed light on the fundamental cause of model overconfidence on OOD.
We also extend the Lottery Ticket Hypothesis to open-world scenarios.
arXiv Detail & Related papers (2022-10-13T14:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.