Related papers: Evolving Interpretable Visual Classifiers with Large Language Models

Evolving Interpretable Visual Classifiers with Large Language Models

URL: http://arxiv.org/abs/2404.09941v1
Date: Mon, 15 Apr 2024 17:09:53 GMT
Title: Evolving Interpretable Visual Classifiers with Large Language Models
Authors: Mia Chiquier, Utkarsh Mall, Carl Vondrick,
Abstract summary: Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. We present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition.
Score: 34.4903887876357
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal pre-trained models, such as CLIP, are popular for zero-shot classification due to their open-vocabulary flexibility and high performance. However, vision-language models, which compute similarity scores between images and class labels, are largely black-box, with limited interpretability, risk for bias, and inability to discover new visual concepts not written down. Moreover, in practical settings, the vocabulary for class names and attributes of specialized concepts will not be known, preventing these methods from performing well on images uncommon in large-scale vision-language datasets. To address these limitations, we present a novel method that discovers interpretable yet discriminative sets of attributes for visual recognition. We introduce an evolutionary search algorithm that uses a large language model and its in-context learning abilities to iteratively mutate a concept bottleneck of attributes for classification. Our method produces state-of-the-art, interpretable fine-grained classifiers. We outperform the latest baselines by 18.4% on five fine-grained iNaturalist datasets and by 22.2% on two KikiBouba datasets, despite the baselines having access to privileged information about class names.

Related papers

Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z)
Dynamic Dictionary Learning for Remote Sensing Image Segmentation [22.457901431083645]
This work introduces a dynamic dictionary learning framework that explicitly models class ID embeddings through iterative refinement. The core contribution lies in a novel dictionary construction mechanism, where class-aware semantic embeddings are progressively updated. Experiments across both coarse- and fine-grained datasets demonstrate consistent improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-03-09T16:25:16Z)
Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition. Our method captures inter-class differences and intra-class commonalities in the form of natural language. VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z)
Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings. An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts) This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z)
Learning to Name Classes for Vision and Language Models [57.0059455405424]
Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. We propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names.
arXiv Detail & Related papers (2023-04-04T14:34:44Z)
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity. We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z)
On Guiding Visual Attention with Language Specification [76.08326100891571]
We use high-level language specification as advice for constraining the classification evidence to task-relevant features, instead of distractors. We show that supervising spatial attention in this way improves performance on classification tasks with biased and noisy data.
arXiv Detail & Related papers (2022-02-17T22:40:19Z)
Learning and Evaluating Representations for Deep One-class Classification [59.095144932794646]
We present a two-stage framework for deep one-class classification. We first learn self-supervised representations from one-class data, and then build one-class classifiers on learned representations. In experiments, we demonstrate state-of-the-art performance on visual domain one-class classification benchmarks.
arXiv Detail & Related papers (2020-11-04T23:33:41Z)
Quantifying Learnability and Describability of Visual Concepts Emerging in Representation Learning [91.58529629419135]
We consider how to characterise visual groupings discovered automatically by deep neural networks. We introduce two concepts, visual learnability and describability, that can be used to quantify the interpretability of arbitrary image groupings.
arXiv Detail & Related papers (2020-10-27T18:41:49Z)
Discriminative Dictionary Design for Action Classification in Still Images and Videos [29.930239762446217]
We propose a novel discriminative method for identifying robust and category specific local features. The framework is validated on the action recognition datasets based on still images and videos.
arXiv Detail & Related papers (2020-05-20T15:56:41Z)
Classification of Chinese Handwritten Numbers with Labeled Projective Dictionary Pair Learning [1.8594711725515674]
We design class-specific dictionaries incorporating three factors: discriminability, sparsity and classification error. We adopt a new feature space, i.e., histogram of oriented gradients (HOG), to generate the dictionary atoms. Results demonstrated enhanced classification performance $(sim98%)$ compared to state-of-the-art deep learning techniques.
arXiv Detail & Related papers (2020-03-26T01:43:59Z)
Adapting Deep Learning for Sentiment Classification of Code-Switched Informal Short Text [1.6752182911522517]
We present a labeled dataset called MultiSenti for sentiment classification of code-switched informal short text. We propose a deep learning-based model for sentiment classification of code-switched informal short text.
arXiv Detail & Related papers (2020-01-04T06:31:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.