Diagnosing and Rectifying Vision Models using Language
- URL: http://arxiv.org/abs/2302.04269v1
- Date: Wed, 8 Feb 2023 18:59:42 GMT
- Title: Diagnosing and Rectifying Vision Models using Language
- Authors: Yuhui Zhang, Jeff Z. HaoChen, Shih-Cheng Huang, Kuan-Chieh Wang, James
Zou, Serena Yeung
- Abstract summary: Recent contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers.
Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language.
Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors.
- Score: 31.588965563961573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent multi-modal contrastive learning models have demonstrated the ability
to learn an embedding space suitable for building strong vision classifiers, by
leveraging the rich information in large-scale image-caption datasets. Our work
highlights a distinct advantage of this multi-modal embedding space: the
ability to diagnose vision classifiers through natural language. The
traditional process of diagnosing model behaviors in deployment settings
involves labor-intensive data acquisition and annotation. Our proposed method
can discover high-error data slices, identify influential attributes and
further rectify undesirable model behaviors, without requiring any visual data.
Through a combination of theoretical explanation and empirical verification, we
present conditions under which classifiers trained on embeddings from one
modality can be equivalently applied to embeddings from another modality. On a
range of image datasets with known error slices, we demonstrate that our method
can effectively identify the error slices and influential attributes, and can
further use language to rectify failure modes of the classifier.
Related papers
- Unsupervised Model Diagnosis [49.36194740479798]
This paper proposes Unsupervised Model Diagnosis (UMO) to produce semantic counterfactual explanations without any user guidance.
Our approach identifies and visualizes changes in semantics, and then matches these changes to attributes from wide-ranging text sources.
arXiv Detail & Related papers (2024-10-08T17:59:03Z) - Machine Vision Therapy: Multimodal Large Language Models Can Enhance Visual Robustness via Denoising In-Context Learning [67.0609518552321]
We propose to conduct Machine Vision Therapy which aims to rectify the noisy predictions from vision models.
By fine-tuning with the denoised labels, the learning model performance can be boosted in an unsupervised manner.
arXiv Detail & Related papers (2023-12-05T07:29:14Z) - Towards Better Modeling with Missing Data: A Contrastive Learning-based
Visual Analytics Perspective [7.577040836988683]
Missing data can pose a challenge for machine learning (ML) modeling.
Current approaches are categorized into feature imputation and label prediction.
This study proposes a Contrastive Learning framework to model observed data with missing values.
arXiv Detail & Related papers (2023-09-18T13:16:24Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Denoising Diffusion Probabilistic Models for Generation of Realistic
Fully-Annotated Microscopy Image Data Sets [1.07539359851877]
In this study, we demonstrate that diffusion models can effectively generate fully-annotated microscopy image data sets.
The proposed pipeline helps to reduce the reliance on manual annotations when training deep learning-based segmentation approaches.
arXiv Detail & Related papers (2023-01-02T14:17:08Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph
Construction [57.854498238624366]
We propose a retrieval-augmented approach, which retrieves schema-aware Reference As Prompt (RAP) for data-efficient knowledge graph construction.
RAP can dynamically leverage schema and knowledge inherited from human-annotated and weak-supervised data as a prompt for each sample.
arXiv Detail & Related papers (2022-10-19T16:40:28Z) - Discovering Bugs in Vision Models using Off-the-shelf Image Generation
and Captioning [25.88974494276895]
This work demonstrates how off-the-shelf, large-scale, image-to-text and text-to-image models can be leveraged to automatically find failures.
In essence, a conditional text-to-image generative model is used to generate large amounts of synthetic, yet realistic, inputs.
arXiv Detail & Related papers (2022-08-18T13:49:10Z) - Discriminative Multimodal Learning via Conditional Priors in Generative
Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training.
We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.