Leveraging Textures in Zero-shot Understanding of Fine-Grained Domains
- URL: http://arxiv.org/abs/2203.11449v1
- Date: Tue, 22 Mar 2022 04:07:20 GMT
- Title: Leveraging Textures in Zero-shot Understanding of Fine-Grained Domains
- Authors: Chenyun Wu and Subhransu Maji
- Abstract summary: We study the effectiveness of large-scale language and vision models (e.g., CLIP) at recognizing texture attributes in natural images.
We first conduct a systematic study of CLIP on texture datasets where we find that it has good coverage for a wide range of texture terms.
We then show how these attributes allow for zero-shot fine-grained categorization on existing datasets.
- Score: 34.848408203825194
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Textures can be used to describe the appearance of objects in a wide range of
fine-grained domains. Textures are localized and one can often refer to their
properties in a manner that is independent of the object identity. Moreover,
there is a rich vocabulary to describe textures corresponding to properties
such as their color, pattern, structure, periodicity, stochasticity, and
others. Motivated by this, we study the effectiveness of large-scale language
and vision models (e.g., CLIP) at recognizing texture attributes in natural
images. We first conduct a systematic study of CLIP on texture datasets where
we find that it has good coverage for a wide range of texture terms. CLIP can
also handle compositional phrases that consist of color and pattern terms
(e.g., red dots or yellow stripes). We then show how these attributes allow for
zero-shot fine-grained categorization on existing datasets.
Related papers
- Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.
We first demonstrate that our approach yields significantly improved capabilities in ranking images by a certain attribute.
We also illustrate that the resulting embeddings obey a larger degree of geometric properties in embedding space.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Are we describing the same sound? An analysis of word embedding spaces
of expressive piano performance [4.867952721052875]
We investigate the uncertainty for the domain of characterizations of expressive piano performance.
We test five embedding models and their similarity structure for correspondence with the ground truth.
The quality of embedding models shows great variability with respect to this task.
arXiv Detail & Related papers (2023-12-31T12:20:03Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - Text2Scene: Text-driven Indoor Scene Stylization with Part-aware Details [12.660352353074012]
We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects.
Our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials.
arXiv Detail & Related papers (2023-08-31T17:37:23Z) - Referring Image Matting [85.77905619102802]
We introduce a new task named Referring Image Matting (RIM) in this paper.
RIM aims to extract the meticulous alpha matte of the specific object that best matches the given natural language description.
RefMatte consists of 230 object categories, 47,500 images, 118,749 expression-region entities, and 474,996 expressions.
arXiv Detail & Related papers (2022-06-10T14:44:43Z) - Topological Semantic Mapping by Consolidation of Deep Visual Features [0.0]
This work introduces a topological semantic mapping method that uses deep visual features extracted by a CNN, the GoogLeNet, from 2D images captured in multiple views of the environment as the robot operates.
The experiments, performed using a real-world indoor dataset, showed that the method is able to consolidate the visual features of regions and use them to recognize objects and place categories as semantic properties.
arXiv Detail & Related papers (2021-06-24T01:10:03Z) - Learning Statistical Texture for Semantic Segmentation [53.7443670431132]
We propose a novel Statistical Texture Learning Network (STLNet) for semantic segmentation.
For the first time, STLNet analyzes the distribution of low level information and efficiently utilizes them for the task.
Based on QCO, two modules are introduced: (1) Texture Enhance Module (TEM), to capture texture-related information and enhance the texture details; (2) Pyramid Texture Feature Extraction Module (PTFEM), to effectively extract the statistical texture features from multiple scales.
arXiv Detail & Related papers (2021-03-06T15:05:35Z) - Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents [17.672677325827454]
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations.
We present a simple unsupervised clustering-based method that increases precision and recall beyond object detection and image tagging baselines.
The proposed method is particularly effective for local contextual meanings of a word, for example associating "granite" with countertops in the real estate dataset and with rocky landscapes in a Wikipedia dataset.
arXiv Detail & Related papers (2020-10-30T16:39:49Z) - Describing Textures using Natural Language [32.076605062485605]
Textures in natural images can be characterized by color, shape, periodicity of elements within them, and other attributes that can be described using natural language.
We study the problem of describing visual attributes of texture on a novel dataset containing rich descriptions of textures.
We present visualizations of several fine-grained domains and show that texture attributes learned on our dataset offer improvements over expert-designed attributes on the Caltech-UCSD Birds dataset.
arXiv Detail & Related papers (2020-08-03T20:37:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.