Conditional Cross Attention Network for Multi-Space Embedding without
Entanglement in Only a SINGLE Network
- URL: http://arxiv.org/abs/2307.13254v1
- Date: Tue, 25 Jul 2023 04:48:03 GMT
- Title: Conditional Cross Attention Network for Multi-Space Embedding without
Entanglement in Only a SINGLE Network
- Authors: Chull Hwan Song, Taebaek Hwang, Jooyoung Yoon, Shunghyun Choi, Yeong
Hyeon Gu
- Abstract summary: We propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone.
Our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets.
- Score: 1.8899300124593648
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Many studies in vision tasks have aimed to create effective embedding spaces
for single-label object prediction within an image. However, in reality, most
objects possess multiple specific attributes, such as shape, color, and length,
with each attribute composed of various classes. To apply models in real-world
scenarios, it is essential to be able to distinguish between the granular
components of an object. Conventional approaches to embedding multiple specific
attributes into a single network often result in entanglement, where
fine-grained features of each attribute cannot be identified separately. To
address this problem, we propose a Conditional Cross-Attention Network that
induces disentangled multi-space embeddings for various specific attributes
with only a single backbone. Firstly, we employ a cross-attention mechanism to
fuse and switch the information of conditions (specific attributes), and we
demonstrate its effectiveness through a diverse visualization example.
Secondly, we leverage the vision transformer for the first time to a
fine-grained image retrieval task and present a simple yet effective framework
compared to existing methods. Unlike previous studies where performance varied
depending on the benchmark dataset, our proposed method achieved consistent
state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K
benchmark datasets.
Related papers
- HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification [15.129037250680582]
Tight visual-linguistic interactions play a vital role in improving classification performance.
Recent Transformer-based methods have achieved great success in multi-label image classification.
We propose a Hierarchical Scale-Aware Vision-Language Transformer (HSVLT) with two appealing designs.
arXiv Detail & Related papers (2024-07-23T07:31:42Z) - Prompt-Driven Dynamic Object-Centric Learning for Single Domain
Generalization [61.64304227831361]
Single-domain generalization aims to learn a model from single source domain data to achieve generalized performance on other unseen target domains.
We propose a dynamic object-centric perception network based on prompt learning, aiming to adapt to the variations in image complexity.
arXiv Detail & Related papers (2024-02-28T16:16:51Z) - Leveraging Off-the-shelf Diffusion Model for Multi-attribute Fashion
Image Manipulation [27.587905673112473]
Fashion attribute editing is a task that aims to convert the semantic attributes of a given fashion image while preserving the irrelevant regions.
Previous works typically employ conditional GANs where the generator explicitly learns the target attributes and directly execute the conversion.
We explore the classifier-guided diffusion that leverages the off-the-shelf diffusion model pretrained on general visual semantics such as Imagenet.
arXiv Detail & Related papers (2022-10-12T02:21:18Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition.
Prior works use visual features extracted with a backbone network, pre-trained for object classification.
We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z) - Diverse Instance Discovery: Vision-Transformer for Instance-Aware
Multi-Label Image Recognition [24.406654146411682]
Vision Transformer (ViT) is the research base for this paper.
Our goal is to leverage ViT's patch tokens and self-attention mechanism to mine rich instances in multi-label images.
We propose a weakly supervised object localization-based approach to extract multi-scale local features.
arXiv Detail & Related papers (2022-04-22T14:38:40Z) - Disentangled Unsupervised Image Translation via Restricted Information
Flow [61.44666983942965]
Many state-of-art methods hard-code the desired shared-vs-specific split into their architecture.
We propose a new method that does not rely on inductive architectural biases.
We show that the proposed method achieves consistently high manipulation accuracy across two synthetic and one natural dataset.
arXiv Detail & Related papers (2021-11-26T00:27:54Z) - Multi-dataset Pretraining: A Unified Model for Semantic Segmentation [97.61605021985062]
We propose a unified framework, termed as Multi-Dataset Pretraining, to take full advantage of the fragmented annotations of different datasets.
This is achieved by first pretraining the network via the proposed pixel-to-prototype contrastive loss over multiple datasets.
In order to better model the relationship among images and classes from different datasets, we extend the pixel level embeddings via cross dataset mixing.
arXiv Detail & Related papers (2021-06-08T06:13:11Z) - SMILE: Semantically-guided Multi-attribute Image and Layout Editing [154.69452301122175]
Attribute image manipulation has been a very active topic since the introduction of Generative Adversarial Networks (GANs)
We present a multimodal representation that handles all attributes, be it guided by random noise or images, while only using the underlying domain information of the target domain.
Our method is capable of adding, removing or changing either fine-grained or coarse attributes by using an image as a reference or by exploring the style distribution space.
arXiv Detail & Related papers (2020-10-05T20:15:21Z) - Multiple instance learning on deep features for weakly supervised object
detection with extreme domain shifts [1.9336815376402716]
Weakly supervised object detection (WSOD) using only image-level annotations has attracted a growing attention over the past few years.
We show that a simple multiple instance approach applied on pre-trained deep features yields excellent performances on non-photographic datasets.
arXiv Detail & Related papers (2020-08-03T20:36:01Z) - Selecting Relevant Features from a Multi-domain Representation for
Few-shot Classification [91.67977602992657]
We propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches.
We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training.
arXiv Detail & Related papers (2020-03-20T15:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.