Related papers: Understanding Visual Concepts Across Models

Understanding Visual Concepts Across Models

URL: http://arxiv.org/abs/2406.07506v1
Date: Tue, 11 Jun 2024 17:40:31 GMT
Title: Understanding Visual Concepts Across Models
Authors: Brandon Trabucco, Max Gurinas, Kyle Doherty, Ruslan Salakhutdinov,
Abstract summary: We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification. We find perturbations within an $epsilon$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost.
Score: 45.18188726287581
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large multimodal models such as Stable Diffusion can generate, detect, and classify new visual concepts after fine-tuning just a single word embedding. Do models learn similar words for the same concepts (i.e. <orange-cat> = orange + cat)? We conduct a large-scale analysis on three state-of-the-art models in text-to-image generation, open-set object detection, and zero-shot classification, and find that new word embeddings are model-specific and non-transferable. Across 4,800 new embeddings trained for 40 diverse visual concepts on four standard datasets, we find perturbations within an $\epsilon$-ball to any prior embedding that generate, detect, and classify an arbitrary concept. When these new embeddings are spliced into new models, fine-tuning that targets the original model is lost. We show popular soft prompt-tuning approaches find these perturbative solutions when applied to visual concept learning tasks, and embeddings for visual concepts are not transferable. Code for reproducing our work is available at: https://visual-words.github.io.

Related papers

Uncovering Conceptual Blindspots in Generative Image Models Using Sparse Autoencoders [28.04396148117613]
We introduce a systematic approach for identifying conceptual blindspots in generative image models.<n>Our approach reveals specific suppressed blindspots and exaggerated blindspots.<n>Overall, we propose a theoretically grounded framework for systematically identifying conceptual blindspots in generative models.
arXiv Detail & Related papers (2025-06-24T15:15:15Z)
Just Say the Word: Annotation-Free Fine-Grained Object Counting [22.31750687552324]
Fine-grained object counting remains a major challenge for class-agnostic counting models.<n>We propose an alternative paradigm: Given a category name, tune a compact concept embedding from the prompt using synthetic images and pseudo-labels.<n>This embedding conditions a specialization module that refines raw overcounts from any frozen counter into accurate, category-specific estimates.
arXiv Detail & Related papers (2025-04-16T02:05:47Z)
Achieving Data Efficient Neural Networks with Hybrid Concept-based Models [0.0]
We introduce two novel model architectures that train using both class labels and additional information in the dataset referred to as concepts. We show that the hybrid concept-based models outperform standard computer vision models with respect to accuracy, especially in sparse data settings. We also introduce an algorithm for performing adversarial concept attacks, where an image is perturbed in a way that does not change a concept-based model's concept predictions, but changes the class prediction.
arXiv Detail & Related papers (2024-08-14T10:15:34Z)
Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use [14.2527771630478]
We propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions. Our framework eliminates the need for crowd-sourced annotations. Our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models.
arXiv Detail & Related papers (2024-03-05T03:34:11Z)
Context-Aware Meta-Learning [52.09326317432577]
We propose a meta-learning algorithm that emulates Large Language Models by learning new visual concepts during inference without fine-tuning. Our approach exceeds or matches the state-of-the-art algorithm, P>M>F, on 8 out of 11 meta-learning benchmarks.
arXiv Detail & Related papers (2023-10-17T03:35:27Z)
Multi-Concept Customization of Text-to-Image Diffusion [51.8642043743222]
We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts. Our model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings.
arXiv Detail & Related papers (2022-12-08T18:57:02Z)
Inter-model Interpretability: Self-supervised Models as a Case Study [0.2578242050187029]
We build on a recent interpretability technique called Dissect to introduce textitinter-model interpretability We project 13 top-performing self-supervised models into a Learned Concepts Embedding space that reveals proximities among models from the perspective of learned concepts. The experiment allowed us to categorize the models into three categories and revealed for the first time the type of visual concepts different tasks requires.
arXiv Detail & Related papers (2022-07-24T22:50:18Z)
Multi-Modal Few-Shot Object Detection with Meta-Learning-Based Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection. Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning. We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z)
Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models [29.413887954758053]
We introduce visual prompting, which learns a task-specific image perturbation such that a frozen pre-trained model prompted with this perturbation performs a new task. We discover that changing only a few pixels is enough to adapt models to new tasks and datasets, and performs on par with linear probing.
arXiv Detail & Related papers (2022-03-31T17:59:30Z)
VGSE: Visually-Grounded Semantic Embeddings for Zero-Shot Learning [113.50220968583353]
We propose to discover semantic embeddings containing discriminative visual properties for zero-shot learning. Our model visually divides a set of images from seen classes into clusters of local image regions according to their visual similarity. We demonstrate that our visually-grounded semantic embeddings further improve performance over word embeddings across various ZSL models by a large margin.
arXiv Detail & Related papers (2022-03-20T03:49:02Z)
Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain. We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.