A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
- URL: http://arxiv.org/abs/2508.20227v1
- Date: Wed, 27 Aug 2025 19:16:40 GMT
- Title: A Novel Framework for Automated Explain Vision Model Using Vision-Language Models
- Authors: Phu-Vinh Nguyen, Tan-Hanh Pham, Chris Ngo, Truong Son Hy,
- Abstract summary: This paper proposes a pipeline to explain vision models at both the sample and dataset levels.<n>The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort.
- Score: 8.771508508443459
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The development of many vision models mainly focuses on improving their performance using metrics such as accuracy, IoU, and mAP, with less attention to explainability due to the complexity of applying xAI methods to provide a meaningful explanation of trained models. Although many existing xAI methods aim to explain vision models sample-by-sample, methods explaining the general behavior of vision models, which can only be captured after running on a large dataset, are still underexplored. Furthermore, understanding the behavior of vision models on general images can be very important to prevent biased judgments and help identify the model's trends and patterns. With the application of Vision-Language Models, this paper proposes a pipeline to explain vision models at both the sample and dataset levels. The proposed pipeline can be used to discover failure cases and gain insights into vision models with minimal effort, thereby integrating vision model development with xAI analysis to advance image analysis.
Related papers
- LVLM-Aided Alignment of Task-Specific Vision Models [49.96265491629163]
Small task-specific vision models are crucial in high-stakes domains.<n>We introduce a novel and efficient method for aligning small task-specific vision models with human domain knowledge.<n>Our method demonstrates substantial improvement in aligning model behavior with human specifications.
arXiv Detail & Related papers (2025-12-26T11:11:25Z) - Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models [27.806966289284528]
We present a unified framework using sparse autoencoders (SAEs) to discover human-interpretable visual features.<n>We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training.
arXiv Detail & Related papers (2025-02-10T18:32:41Z) - Toward universal steering and monitoring of AI models [16.303681959333883]
We develop a scalable approach for extracting linear representations of general concepts in large-scale AI models.<n>We show how these representations enable model steering, through which we expose vulnerabilities, misaligned behaviors, and improve model capabilities.
arXiv Detail & Related papers (2025-02-06T01:41:48Z) - A Survey on Vision Autoregressive Model [15.042485771127346]
Autoregressive models have demonstrated impressive performance in natural language processing (NLP)
Inspired by their notable success in NLP field, autoregressive models have been intensively investigated recently for computer vision.
This paper provides a systematic review of vision autoregressive models, including the development of a taxonomy of existing methods.
arXiv Detail & Related papers (2024-11-13T14:59:41Z) - Autoregressive Models in Vision: A Survey [119.23742136065307]
This survey comprehensively examines the literature on autoregressive models applied to vision.<n>We divide visual autoregressive models into three general sub-categories, including pixel-based, token-based, and scale-based models.<n>We present a multifaceted categorization of autoregressive models in computer vision, including image generation, video generation, 3D generation, and multimodal generation.
arXiv Detail & Related papers (2024-11-08T17:15:12Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data.
We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z) - GPT4Image: Large Pre-trained Models Help Vision Models Learn Better on Perception Task [47.1857510710807]
We present a new learning framework, dubbed GPT4Image, where the knowledge of the large pre-trained models are extracted to help CNNs and ViTs learn better representations.<n>We conduct extensive experiments to verify the effectiveness of the proposed algorithm on various visual perception tasks.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Scaling Vision-Language Models with Sparse Mixture of Experts [128.0882767889029]
We show that mixture-of-experts (MoE) techniques can achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost.
Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling vision-language models.
arXiv Detail & Related papers (2023-03-13T16:00:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.