Related papers: ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling

URL: http://arxiv.org/abs/2408.04102v3
Date: Wed, 2 Oct 2024 12:48:44 GMT
Title: ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling
Authors: William Yicheng Zhu, Keren Ye, Junjie Ke, Jiahui Yu, Leonidas Guibas, Peyman Milanfar, Feng Yang,
Abstract summary: We propose a sentence generation-based retrieval formulation for attribute recognition. For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
Score: 32.55352435358949
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recognizing and disentangling visual attributes from objects is a foundation to many computer vision applications. While large vision language representations like CLIP had largely resolved the task of zero-shot object recognition, zero-shot visual attribute recognition remains a challenge because CLIP's contrastively-learned vision-language representation cannot effectively capture object-attribute dependencies. In this paper, we target this weakness and propose a sentence generation-based retrieval formulation for attribute recognition that is novel in 1) explicitly modeling a to-be-measured and retrieved object-attribute relation as a conditional probability graph, which converts the recognition problem into a dependency-sensitive language-modeling problem, and 2) applying a large pretrained Vision-Language Model (VLM) on this reformulation and naturally distilling its knowledge of image-object-attribute relations to use towards attribute recognition. Specifically, for each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence encoding the attribute's relation to objects on the image. Unlike contrastive retrieval, which measures likelihood by globally aligning elements of the sentence to the image, generative retrieval is sensitive to the order and dependency of objects and attributes in the sentence. We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets, Visual Attribute in the Wild (VAW), and our newly-proposed Visual Genome Attribute Ranking (VGARank).

Related papers

Omni-Attribute: Open-vocabulary Attribute Encoder for Visual Concept Personalization [82.31106470150844]
We introduce Omni-Attribute, the first open-vocabulary image attribute encoder to learn attribute-specific representations.<n>We use a dual-objective training paradigm that balances generative fidelity with contrastive disentanglement.<n>The resulting embeddings prove effective for open-vocabulary attribute retrieval, personalization, and compositional generation.
arXiv Detail & Related papers (2025-12-11T18:59:56Z)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification [78.73711446918814]
We propose a novel framework named LATex for AG-ReID, which adopts prompt-tuning strategies to leverage attribute-based text knowledge.<n>Our framework can fully leverage attribute-based text knowledge to improve AGReID performance.
arXiv Detail & Related papers (2025-03-31T04:47:05Z)
Compositional Caching for Training-free Open-vocabulary Attribute Detection [65.46250297408974]
We present Compositional Caching (ComCa), a training-free method for open-vocabulary attribute detection. ComCa requires only the list of target attributes and objects as input, using them to populate an auxiliary cache of images. Experiments on public datasets demonstrate that ComCa significantly outperforms zero-shot and cache-based baselines.
arXiv Detail & Related papers (2025-03-24T21:00:37Z)
Modeling Visual Memorability Assessment with Autoencoders Reveals Characteristics of Memorable Images [2.4861619769660637]
Image memorability refers to the phenomenon where certain images are more likely to be remembered than others. We modeled the subjective experience of visual memorability using an autoencoder based on VGG16 Convolutional Neural Networks (CNNs) We investigated the relationship between memorability and reconstruction error, assessed latent space representations distinctiveness, and developed a Gated Recurrent Unit (GRU) model to predict memorability likelihood.
arXiv Detail & Related papers (2024-10-19T22:58:33Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
A Generative Approach for Wikipedia-Scale Visual Entity Recognition [56.55633052479446]
We address the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. We introduce a novel Generative Entity Recognition framework, which learns to auto-regressively decode a semantic and discriminative code'' identifying the target entity.
arXiv Detail & Related papers (2024-03-04T13:47:30Z)
Multi-modal Attribute Prompting for Vision-Language Models [40.39559705414497]
Pre-trained Vision-Language Models (VLMs) exhibit strong generalization ability to downstream tasks but struggle in few-shot scenarios. Existing prompting techniques primarily focus on global text and image representations, yet overlooking multi-modal attribute characteristics. We propose a Multi-modal Attribute Prompting method (MAP) by jointly exploring textual attribute prompting, visual attribute prompting, and attribute-level alignment.
arXiv Detail & Related papers (2024-03-01T01:28:10Z)
Improving Generalization of Image Captioning with Unsupervised Prompt Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data. GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z)
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection [33.77415850289717]
Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context. It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding. Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective.
arXiv Detail & Related papers (2023-03-17T16:14:37Z)
OvarNet: Towards Open-vocabulary Object Attribute Recognition [42.90477523238336]
We propose a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes. We show that recognition of semantic category and attributes is complementary for visual scene understanding.
arXiv Detail & Related papers (2023-01-23T15:59:29Z)
Perceptual Grouping in Contrastive Vision-Language Models [59.1542019031645]
We show how vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information.
arXiv Detail & Related papers (2022-10-18T17:01:35Z)
Exploring CLIP for Assessing the Look and Feel of Images [87.97623543523858]
We introduce Contrastive Language-Image Pre-training (CLIP) models for assessing both the quality perception (look) and abstract perception (feel) of images in a zero-shot manner. Our results show that CLIP captures meaningful priors that generalize well to different perceptual assessments.
arXiv Detail & Related papers (2022-07-25T17:58:16Z)
Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks. We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.