Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
- URL: http://arxiv.org/abs/2403.20331v1
- Date: Fri, 29 Mar 2024 17:59:53 GMT
- Title: Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
- Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa,
- Abstract summary: This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD)
UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks.
To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents.
- Score: 84.78457918843165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. UPD encompasses three distinct settings: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD). To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents, highlighting significant room for the improvements. To address UPD, we explore both training-free and training-based solutions, offering new insights into their effectiveness and limitations. We hope our insights, together with future efforts within the proposed UPD settings, will enhance the broader understanding and development of more practical and reliable VLMs.
Related papers
- LOVA3: Learning to Visual Question Answering, Asking and Assessment [63.41469979867312]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge.
Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills.
In this study, we introduce LOVA3, an innovative framework named Learning tO Visual Question Answering, Asking and Assessment''
arXiv Detail & Related papers (2024-05-23T18:21:59Z) - Effectiveness Assessment of Recent Large Vision-Language Models [78.69439393646554]
This paper endeavors to evaluate the competency of popular large vision-language models (LVLMs) in specialized and general tasks.
We employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial.
We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks.
arXiv Detail & Related papers (2024-03-07T08:25:27Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Towards a Unified Multimodal Reasoning Framework [0.5120567378386615]
This report investigates the potential impact of combining Chain-of-Thought (CoT) reasoning and Visual Question Answering (VQA) techniques.
By employing TextVQA and ScienceQA datasets, we assessed the effectiveness of three text embedding methods and three visual embedding approaches.
Results from our experiments demonstrated the potential of these approaches in enhancing LM's reasoning and question-answering capabilities.
arXiv Detail & Related papers (2023-12-22T19:07:00Z) - A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual
Question Answering [56.01977227584777]
multimodal large models (MLMs) has significantly advanced the field of visual understanding.
Yet, the true challenge lies in the domain of knowledge-intensive visual question answering (VQA) tasks.
This study provides an in-depth evaluation of the newly introduced GPT-4V.
arXiv Detail & Related papers (2023-11-13T18:22:32Z) - Lost in Translation: When GPT-4V(ision) Can't See Eye to Eye with Text.
A Vision-Language-Consistency Analysis of VLLMs and Beyond [7.760124498553333]
We study whether vision-language models execute vision and language tasks consistently or independently.
We introduce a systematic framework that quantifies the capability disparities between different modalities in the multi-modal setting.
We introduce "Vision Description Prompting," a method that effectively improves performance in challenging vision-related tasks.
arXiv Detail & Related papers (2023-10-19T06:45:11Z) - Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers.
More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner.
Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.