Related papers: Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor

URL: http://arxiv.org/abs/2507.03542v2
Date: Wed, 09 Jul 2025 03:01:23 GMT
Title: Beyond Accuracy: Metrics that Uncover What Makes a 'Good' Visual Descriptor
Authors: Ethan Lin, Linxi Zhao, Atharva Sehgal, Jennifer J. Sun,
Abstract summary: Descriptors are used in visual concept discovery and image classification with vision-based models (VLMs)<n>We systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data.<n>Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy.
Score: 4.76296755805531
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-based visual descriptors--ranging from simple class names to more descriptive phrases--are widely used in visual concept discovery and image classification with vision-language models (VLMs). Their effectiveness, however, depends on a complex interplay of factors, including semantic clarity, presence in the VLM's pre-training data, and how well the descriptors serve as a meaningful representation space. In this work, we systematically analyze descriptor quality along two key dimensions: (1) representational capacity, and (2) relationship with VLM pre-training data. We evaluate a spectrum of descriptor generation methods, from zero-shot LLM-generated prompts to iteratively refined descriptors. Motivated by ideas from representation alignment and language understanding, we introduce two alignment-based metrics--Global Alignment and CLIP Similarity--that move beyond accuracy. These metrics shed light on how different descriptor generation strategies interact with foundation model properties, offering new ways to study descriptor effectiveness beyond accuracy evaluations.

Related papers

AgenticTagger: Structured Item Representation for Recommendation with LLM Agents [58.12004213978182]
AgenticTagger is a framework that queries LLMs for representing items with sequences of text descriptors.<n>To effectively and efficiently ground vocabulary in the item corpus of interest, we design a multi-agent reflection mechanism.<n>Experiments on public and private data show AgenticTagger brings consistent improvements across diverse recommendation scenarios.
arXiv Detail & Related papers (2026-02-05T18:01:37Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [78.73711446918814]
We propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve AGReID performance.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
SGC-Net: Stratified Granular Comparison Network for Open-Vocabulary HOI Detection [16.89965584177711]
Recent open-vocabulary human-object interaction (OV-HOI) detection methods rely on large language model (LLM) for generating auxiliary descriptions and leverage knowledge distilled from CLIP to detect unseen interaction categories.<n>Despite their effectiveness, these methods face two challenges: (1) feature granularity deficiency, due to reliance on last layer visual features for text alignment, leading to the neglect of crucial object-level details from intermediate layers; (2) semantic similarity confusion, resulting from CLIP's inherent biases toward certain classes, while LLM-generated descriptions based solely on labels fail to adequately capture inter-class similarities.
arXiv Detail & Related papers (2025-03-01T09:26:05Z)
Does VLM Classification Benefit from LLM Description Semantics? [26.743684911323857]
We propose a training-free method for selecting discriminative descriptions that work independently of classname-ensembling effects.<n>Our approach identifies descriptions that effectively differentiate classes within a local CLIP label neighborhood, improving classification accuracy across seven datasets.
arXiv Detail & Related papers (2024-12-16T16:01:18Z)
Verbalized Representation Learning for Interpretable Few-Shot Generalization [130.8173035901391]
Verbalized Representation Learning (VRL) is a novel approach for automatically extracting human-interpretable features for object recognition.<n>Our method captures inter-class differences and intra-class commonalities in the form of natural language.<n>VRL achieves a 24% absolute improvement over prior state-of-the-art methods.
arXiv Detail & Related papers (2024-11-27T01:55:08Z)
RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories. This paper introduces a Retrieving And Ranking augmented method for MLLMs. Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z)
LLMs as Visual Explainers: Advancing Image Classification with Evolving Visual Descriptions [13.546494268784757]
We propose a framework that integrates large language models (LLMs) and vision-language models (VLMs) to find the optimal class descriptors. Our training-free approach develops an LLM-based agent with an evolutionary optimization strategy to iteratively refine class descriptors.
arXiv Detail & Related papers (2023-11-20T16:37:45Z)
Videoprompter: an ensemble of foundational models for zero-shot video understanding [113.92958148574228]
Vision-language models (VLMs) classify the query video by calculating a similarity score between the visual features and text-based class label representations. We propose a framework which combines pre-trained discrimi VLMs with pre-trained generative video-to-text and text-to-text models.
arXiv Detail & Related papers (2023-10-23T19:45:46Z)
Text Descriptions are Compressive and Invariant Representations for Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting. In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors). This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z)
Waffling around for Performance: Visual Classification with Random Words and Broad Concepts [121.60918966567657]
WaffleCLIP is a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors.
arXiv Detail & Related papers (2023-06-12T17:59:48Z)
DisCLIP: Open-Vocabulary Referring Expression Generation [37.789850573203694]
We build on CLIP, a large-scale visual-semantic model, to guide an LLM to generate a contextual description of a target concept in an image. We measure the quality of the generated text by evaluating the capability of a receiver model to accurately identify the described object within the scene. Our results highlight the potential of using pre-trained visual-semantic models for generating high-quality contextual descriptions.
arXiv Detail & Related papers (2023-05-30T15:13:17Z)
OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes. We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z)
DetCLIP: Dictionary-Enriched Visual-Concept Paralleled Pre-training for Open-world Detection [118.36746273425354]
This paper presents a paralleled visual-concept pre-training method for open-world detection by resorting to knowledge enrichment from a designed concept dictionary. By enriching the concepts with their descriptions, we explicitly build the relationships among various concepts to facilitate the open-domain learning. The proposed framework demonstrates strong zero-shot detection performances, e.g., on the LVIS dataset, our DetCLIP-T outperforms GLIP-T by 9.9% mAP and obtains a 13.5% improvement on rare categories.
arXiv Detail & Related papers (2022-09-20T02:01:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.