Related papers: Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

URL: http://arxiv.org/abs/2403.20331v1
Date: Fri, 29 Mar 2024 17:59:53 GMT
Title: Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models
Authors: Atsuyuki Miyai, Jingkang Yang, Jingyang Zhang, Yifei Ming, Qing Yu, Go Irie, Yixuan Li, Hai Li, Ziwei Liu, Kiyoharu Aizawa,
Abstract summary: This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD) UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents.
Score: 84.78457918843165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. UPD encompasses three distinct settings: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD). To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents, highlighting significant room for the improvements. To address UPD, we explore both training-free and training-based solutions, offering new insights into their effectiveness and limitations. We hope our insights, together with future efforts within the proposed UPD settings, will enhance the broader understanding and development of more practical and reliable VLMs.

Related papers

Uncertainty-o: One Model-agnostic Framework for Unveiling Uncertainty in Large Multimodal Models [30.709848959820015]
Uncertainty-o is a model-agnostic framework designed to reveal uncertainty in LMMs regardless of their modalities, architectures, or capabilities.<n>Experiments across 18 benchmarks spanning various modalities and 10 LMMs demonstrate the effectiveness of Uncertainty-o in reliably estimating LMM uncertainty.
arXiv Detail & Related papers (2025-06-09T09:20:20Z)
MOAT: Evaluating LMMs for Capability Integration and Instruction Grounding [27.140576967695413]
Large multimodal models (LMMs) have demonstrated significant potential as generalists in vision-language (VL) tasks. There remains a significant gap between state-of-the-art LMMs and human performance. We propose MOAT, a benchmark with complex real-world VL tasks that are challenging for LMMs.
arXiv Detail & Related papers (2025-03-12T12:49:31Z)
MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency [63.23935582919081]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs) We introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs. We conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights.
arXiv Detail & Related papers (2025-02-13T18:59:46Z)
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor [12.829202761125096]
Large multimodal models (LMMs) have shown remarkable performance in the visual commonsense reasoning (VCR) task. However, the ability of LMMs to correct potential visual commonsense errors in the distractor upon their occurrence is yet under-explored. We are the pioneering research for LMMs to simulate this error correction process.
arXiv Detail & Related papers (2024-12-08T03:59:59Z)
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models [56.537253374781876]
Large Multimodal Models (LMMs) have achieved strong performance across a range of vision and language tasks. However, their spatial reasoning capabilities are under-investigated. We construct a novel VQA dataset, Spatial-MM, to comprehensively study LMMs' spatial understanding and reasoning capabilities.
arXiv Detail & Related papers (2024-11-09T03:07:33Z)
LOKI: A Comprehensive Synthetic Data Detection Benchmark using Large Multimodal Models [55.903148392998965]
We introduce LOKI, a novel benchmark designed to evaluate the ability of LMMs to detect synthetic data across multiple modalities. The benchmark includes coarse-grained judgment and multiple-choice questions, as well as fine-grained anomaly selection and explanation tasks. We evaluate 22 open-source LMMs and 6 closed-source models on LOKI, highlighting their potential as synthetic data detectors and also revealing some limitations in the development of LMM capabilities.
arXiv Detail & Related papers (2024-10-13T05:26:36Z)
DARE: Diverse Visual Question Answering with Robustness Evaluation [16.87867803628065]
Vision Language Models (VLMs) extend remarkable capabilities of text-only large language models and vision-only models. They struggle with a number of crucial vision-language (VL) reasoning abilities such as counting and spatial reasoning. We introduce DARE, Diverse Visual Question Answering with Robustness Evaluation.
arXiv Detail & Related papers (2024-09-26T16:31:50Z)
Generalized Out-of-Distribution Detection and Beyond in Vision Language Model Era: A Survey [107.08019135783444]
We first present a generalized OOD detection v2, encapsulating the evolution of AD, ND, OSR, OOD detection, and OD in the VLM era. Our framework reveals that, with some field inactivity and integration, the demanding challenges have become OOD detection and AD.
arXiv Detail & Related papers (2024-07-31T17:59:58Z)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context. Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets. Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z)
LOVA3: Learning to Visual Question Answering, Asking and Assessment [61.51687164769517]
Question answering, asking, and assessment are three innate human traits crucial for understanding the world and acquiring knowledge. Current Multimodal Large Language Models (MLLMs) primarily focus on question answering, often neglecting the full potential of questioning and assessment skills. We introduce LOVA3, an innovative framework named "Learning tO Visual question Answering, Asking and Assessment"
arXiv Detail & Related papers (2024-05-23T18:21:59Z)
LMM-PCQA: Assisting Point Cloud Quality Assessment with LMM [83.98966702271576]
This study aims to investigate the feasibility of imparting Point Cloud Quality Assessment (PCQA) knowledge to large multi-modality models (LMMs) We transform quality labels into textual descriptions during the fine-tuning phase, enabling LMMs to derive quality rating logits from 2D projections of point clouds. Our experimental results affirm the effectiveness of our approach, showcasing a novel integration of LMMs into PCQA.
arXiv Detail & Related papers (2024-04-28T14:47:09Z)
Effectiveness Assessment of Recent Large Vision-Language Models [78.69439393646554]
This paper endeavors to evaluate the competency of popular large vision-language models (LVLMs) in specialized and general tasks. We employ six challenging tasks in three different application scenarios: natural, healthcare, and industrial. We examine the performance of three recent open-source LVLMs, including MiniGPT-v2, LLaVA-1.5, and Shikra, on both visual recognition and localization in these tasks.
arXiv Detail & Related papers (2024-03-07T08:25:27Z)
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning. Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z)
LLMs May Perform MCQA by Selecting the Least Incorrect Option [29.202758753639078]
Large Language Models (LLMs) have markedly enhanced performance across a variety of tasks. The adoption of Multiple Choice Question Answering (MCQA) as a benchmark for assessing LLMs has gained considerable traction. However, concerns regarding the robustness of this evaluative method persist.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
On the Robustness of Large Multimodal Models Against Image Adversarial Attacks [81.2935966933355]
We study the impact of visual adversarial attacks on Large Multimodal Models (LMMs) We find that in general LMMs are not robust to visual adversarial inputs. We propose a new approach to real-world image classification which we term query decomposition.
arXiv Detail & Related papers (2023-12-06T04:59:56Z)
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering [45.88079503965459]
This work introduces a novel, explainable multi-agent collaboration framework by leveraging the expansive knowledge of Large Language Models (LLMs) to enhance the capabilities of Vision Language Models (VLMs)
arXiv Detail & Related papers (2023-11-29T03:10:42Z)
Language Models as Knowledge Bases for Visual Word Sense Disambiguation [1.8591405259852054]
We propose some knowledge-enhancement techniques towards improving the retrieval performance of visiolinguistic (VL) transformers. More specifically, knowledge stored in Large Language Models (LLMs) is retrieved with the help of appropriate prompts in a zero-shot manner. Our presented approach is the first one to analyze the merits of exploiting knowledge stored in LLMs in different ways to solve Visual Word Sense Disambiguation.
arXiv Detail & Related papers (2023-10-03T11:11:55Z)
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [153.37868034779385]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks. Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.