Related papers: Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning

URL: http://arxiv.org/abs/2308.04016v1
Date: Tue, 8 Aug 2023 03:24:21 GMT
Title: Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning
Authors: Hanjae Kim, Jiyoung Lee, Seongheon Park, Kwanghoon Sohn
Abstract summary: Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object) We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
Score: 52.506434446439776
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT.

Related papers

A Conditional Probability Framework for Compositional Zero-shot Learning [86.86063926727489]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions.<n>Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning.<n>We adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies.
arXiv Detail & Related papers (2025-07-23T10:20:52Z)
Few-Shot Inspired Generative Zero-Shot Learning [14.66239393852298]
Generative zero-shot learning (ZSL) methods typically synthesize visual features for unseen classes.<n>We propose FSIGenZ, a few-shot-inspired generative ZSL framework that reduces reliance on large-scale feature synthesis.<n>Experiments on SUN, AwA2, and CUB benchmarks demonstrate that FSIGenZ achieves competitive performance using far fewer synthetic features.
arXiv Detail & Related papers (2025-06-18T02:39:36Z)
Hybrid Discriminative Attribute-Object Embedding Network for Compositional Zero-Shot Learning [83.10178754323955]
Hybrid Discriminative Attribute-Object Embedding (HDA-OE) network is proposed to solve the problem of complex interactions between attributes and object visual representations. To increase the variability of training data, HDA-OE introduces an attribute-driven data synthesis (ADDS) module. To further improve the discriminative ability of the model, HDA-OE introduces the subclass-driven discriminative embedding (SDDE) module. The proposed model has been evaluated on three benchmark datasets, and the results verify its effectiveness and reliability.
arXiv Detail & Related papers (2024-11-28T09:50:25Z)
Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning [21.488599805772054]
Compositional zero-shot learning aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attribute and object by extracting shared and exclusive parts between image pairs sharing the same attribute (object) We propose a novel framework named Multimodal Large Language Model (MLLM) embeddings and attribute smoothing guided disentanglement (TRIDENT) for CZSL.
arXiv Detail & Related papers (2024-11-18T07:55:54Z)
Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [49.919635694894204]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL) We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions.
arXiv Detail & Related papers (2024-08-19T08:23:09Z)
CREST: Cross-modal Resonance through Evidential Deep Learning for Enhanced Zero-Shot Learning [48.46511584490582]
Zero-shot learning (ZSL) enables the recognition of novel classes by leveraging semantic knowledge transfer from known to unknown categories. Real-world challenges such as distribution imbalances and attribute co-occurrence hinder the discernment of local variances in images. We propose a bidirectional cross-modal ZSL approach CREST to overcome these challenges.
arXiv Detail & Related papers (2024-04-15T10:19:39Z)
Attribute-Aware Representation Rectification for Generalized Zero-Shot Learning [19.65026043141699]
Generalized Zero-shot Learning (GZSL) has yielded remarkable performance by designing a series of unbiased visual-semantics mappings. We propose a simple yet effective Attribute-Aware Representation Rectification framework for GZSL, dubbed $mathbf(AR)2$.
arXiv Detail & Related papers (2023-11-23T11:30:32Z)
Dual Feature Augmentation Network for Generalized Zero-shot Learning [14.410978100610489]
Zero-shot learning (ZSL) aims to infer novel classes without training samples by transferring knowledge from seen classes. Existing embedding-based approaches for ZSL typically employ attention mechanisms to locate attributes on an image. We propose a novel Dual Feature Augmentation Network (DFAN), which comprises two feature augmentation modules.
arXiv Detail & Related papers (2023-09-25T02:37:52Z)
Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts. One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different. We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z)
Exploiting Semantic Attributes for Transductive Zero-Shot Learning [97.61371730534258]
Zero-shot learning aims to recognize unseen classes by generalizing the relation between visual features and semantic attributes learned from the seen classes. We present a novel transductive ZSL method that produces semantic attributes of the unseen data and imposes them on the generative process. Experiments on five standard benchmarks show that our method yields state-of-the-art results for zero-shot learning.
arXiv Detail & Related papers (2023-03-17T09:09:48Z)
Learning Invariant Visual Representations for Compositional Zero-Shot Learning [30.472541551048508]
Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions using knowledge learned from seen-object compositions in the training set. We propose an invariant feature learning framework to align different domains at the representation and gradient levels. Experiments on two CZSL benchmarks demonstrate that the proposed method significantly outperforms the previous state-of-the-art.
arXiv Detail & Related papers (2022-06-01T11:33:33Z)
Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition. Prior works use visual features extracted with a backbone network, pre-trained for object classification. We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z)
Cross-modal Representation Learning for Zero-shot Action Recognition [67.57406812235767]
We present a cross-modal Transformer-based framework, which jointly encodes video data and text labels for zero-shot action recognition (ZSAR) Our model employs a conceptually new pipeline by which visual representations are learned in conjunction with visual-semantic associations in an end-to-end manner. Experiment results show our model considerably improves upon the state of the arts in ZSAR, reaching encouraging top-1 accuracy on UCF101, HMDB51, and ActivityNet benchmark datasets.
arXiv Detail & Related papers (2022-05-03T17:39:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.