Discovering Bugs in Vision Models using Off-the-shelf Image Generation
and Captioning
- URL: http://arxiv.org/abs/2208.08831v2
- Date: Thu, 11 May 2023 17:13:16 GMT
- Title: Discovering Bugs in Vision Models using Off-the-shelf Image Generation
and Captioning
- Authors: Olivia Wiles, Isabela Albuquerque, Sven Gowal
- Abstract summary: This work demonstrates how off-the-shelf, large-scale, image-to-text and text-to-image models can be leveraged to automatically find failures.
In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs.
- Score: 25.88974494276895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically discovering failures in vision models under real-world settings
remains an open challenge. This work demonstrates how off-the-shelf,
large-scale, image-to-text and text-to-image models, trained on vast amounts of
data, can be leveraged to automatically find such failures. In essence, a
conditional text-to-image generative model is used to generate large amounts of
synthetic, yet realistic, inputs given a ground-truth label. Misclassified
inputs are clustered and a captioning model is used to describe each cluster.
Each cluster's description is used in turn to generate more inputs and assess
whether specific clusters induce more failures than expected. We use this
pipeline to demonstrate that we can effectively interrogate classifiers trained
on ImageNet to find specific failure cases and discover spurious correlations.
We also show that we can scale the approach to generate adversarial datasets
targeting specific classifier architectures. This work serves as a
proof-of-concept demonstrating the utility of large-scale generative models to
automatically discover bugs in vision models in an open-ended manner. We also
describe a number of limitations and pitfalls related to this approach.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection [5.66050466694651]
We propose Vision-Language (VL) encoders into existing anomaly detectors to leverage the semantically broad VL pre-training for improved outlier awareness.
We also propose a new scoring function that enables data- and training-free outlier supervision via textual prompts.
The resulting VL4AD model achieves competitive performance on widely used benchmark datasets.
arXiv Detail & Related papers (2024-09-25T20:12:10Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Identifying and Mitigating Model Failures through Few-shot CLIP-aided
Diffusion Generation [65.268245109828]
We propose an end-to-end framework to generate text descriptions of failure modes associated with spurious correlations.
These descriptions can be used to generate synthetic data using generative models, such as diffusion models.
Our experiments have shown remarkable textbfimprovements in accuracy ($sim textbf21%$) on hard sub-populations.
arXiv Detail & Related papers (2023-12-09T04:43:49Z) - Self-Supervised Open-Ended Classification with Small Visual Language
Models [60.23212389067007]
We present Self-Context Adaptation (SeCAt), a self-supervised approach that unlocks few-shot abilities for open-ended classification with small visual language models.
By using models with approximately 1B parameters we outperform the few-shot abilities of much larger models, such as Frozen and FROMAGe.
arXiv Detail & Related papers (2023-09-30T21:41:21Z) - Diagnosing and Rectifying Vision Models using Language [31.588965563961573]
Recent contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers.
Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language.
Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors.
arXiv Detail & Related papers (2023-02-08T18:59:42Z) - Adaptive Testing of Computer Vision Models [22.213542525825144]
We introduce AdaVision, an interactive process for testing vision models which helps users identify and fix coherent failure modes.
We demonstrate the usefulness and generality of AdaVision in user studies, where users find major bugs in state-of-the-art classification, object detection, and image captioning models.
arXiv Detail & Related papers (2022-12-06T05:52:31Z) - ClipCrop: Conditioned Cropping Driven by Vision-Language Model [90.95403416150724]
We take advantage of vision-language models as a foundation for creating robust and user-intentional cropping algorithms.
We develop a method to perform cropping with a text or image query that reflects the user's intention as guidance.
Our pipeline design allows the model to learn text-conditioned aesthetic cropping with a small dataset.
arXiv Detail & Related papers (2022-11-21T14:27:07Z) - Text2Model: Text-based Model Induction for Zero-shot Image Classification [38.704831945753284]
We address the challenge of building task-agnostic classifiers using only text descriptions.
We generate zero-shot classifiers using a hypernetwork that receives class descriptions and outputs a multi-class model.
We evaluate this approach in a series of zero-shot classification tasks, for image, point-cloud, and action recognition, using a range of text descriptions.
arXiv Detail & Related papers (2022-10-27T05:19:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.