Investigating the Role of Attribute Context in Vision-Language Models
for Object Recognition and Detection
- URL: http://arxiv.org/abs/2303.10093v2
- Date: Mon, 6 Nov 2023 20:58:11 GMT
- Title: Investigating the Role of Attribute Context in Vision-Language Models
for Object Recognition and Detection
- Authors: Kyle Buettner, Adriana Kovashka
- Abstract summary: Methods are mostly evaluated in terms of how well object class names are learned, but captions also contain rich attribute context.
It is unclear how methods use this context in learning, as well as whether models succeed when tasks require attribute and object understanding.
Our results show that attribute context can be wasted when learning alignment for detection, attribute meaning is not adequately considered in embeddings, and describing classes by only their attributes is ineffective.
- Score: 33.77415850289717
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language alignment learned from image-caption pairs has been shown to
benefit tasks like object recognition and detection. Methods are mostly
evaluated in terms of how well object class names are learned, but captions
also contain rich attribute context that should be considered when learning
object alignment. It is unclear how methods use this context in learning, as
well as whether models succeed when tasks require attribute and object
understanding. To address this gap, we conduct extensive analysis of the role
of attributes in vision-language models. We specifically measure model
sensitivity to the presence and meaning of attribute context, gauging influence
on object embeddings through unsupervised phrase grounding and classification
via description methods. We further evaluate the utility of attribute context
in training for open-vocabulary object detection, fine-grained text-region
retrieval, and attribution tasks. Our results show that attribute context can
be wasted when learning alignment for detection, attribute meaning is not
adequately considered in embeddings, and describing classes by only their
attributes is ineffective. A viable strategy that we find to increase benefits
from attributes is contrastive training with adjective-based negative captions.
Related papers
- ArtVLM: Attribute Recognition Through Vision-Based Prefix Language Modeling [32.55352435358949]
We propose a sentence generation-based retrieval formulation for attribute recognition.
For each attribute to be recognized on an image, we measure the visual-conditioned probability of generating a short sentence.
We demonstrate through experiments that generative retrieval consistently outperforms contrastive retrieval on two visual reasoning datasets.
arXiv Detail & Related papers (2024-08-07T21:44:29Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - LOWA: Localize Objects in the Wild with Attributes [8.922263691331912]
We present LOWA, a novel method for localizing objects with attributes effectively in the wild.
It aims to address the insufficiency of current open-vocabulary object detectors, which are limited by the lack of instance-level attribute classification and rare class names.
arXiv Detail & Related papers (2023-05-31T17:21:24Z) - Open-vocabulary Attribute Detection [38.5017012867974]
This paper introduces the Open-Vocabulary Attribute Detection task and the corresponding OVAD benchmark.
The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models.
Overall, the benchmark consists of 1.4 million annotations.
arXiv Detail & Related papers (2022-11-23T12:34:43Z) - Attribute Prototype Network for Any-Shot Learning [113.50220968583353]
We argue that an image representation with integrated attribute localization ability would be beneficial for any-shot, i.e. zero-shot and few-shot, image classification tasks.
We propose a novel representation learning framework that jointly learns global and local features using only class-level attributes.
arXiv Detail & Related papers (2022-04-04T02:25:40Z) - Context-LGM: Leveraging Object-Context Relation for Context-Aware Object
Recognition [48.5398871460388]
We propose a novel Contextual Latent Generative Model (Context-LGM), which considers the object-context relation and models it in a hierarchical manner.
To infer contextual features, we reformulate the objective function of Variational Auto-Encoder (VAE), where contextual features are learned as a posterior conditioned distribution on the object.
The effectiveness of our method is verified by state-of-the-art performance on two context-aware object recognition tasks.
arXiv Detail & Related papers (2021-10-08T11:31:58Z) - Understanding Synonymous Referring Expressions via Contrastive Features [105.36814858748285]
We develop an end-to-end trainable framework to learn contrastive features on the image and object instance levels.
We conduct extensive experiments to evaluate the proposed algorithm on several benchmark datasets.
arXiv Detail & Related papers (2021-04-20T17:56:24Z) - Learning Object Detection from Captions via Textual Scene Attributes [70.90708863394902]
We argue that captions contain much richer information about the image, including attributes of objects and their relations.
We present a method that uses the attributes in this "textual scene graph" to train object detectors.
We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets.
arXiv Detail & Related papers (2020-09-30T10:59:20Z) - CompGuessWhat?!: A Multi-task Evaluation Framework for Grounded Language
Learning [78.3857991931479]
We present GROLLA, an evaluation framework for Grounded Language Learning with Attributes.
We also propose a new dataset CompGuessWhat?! as an instance of this framework for evaluating the quality of learned neural representations.
arXiv Detail & Related papers (2020-06-03T11:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.