Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
- URL: http://arxiv.org/abs/2411.12584v2
- Date: Sun, 08 Jun 2025 06:36:03 GMT
- Title: Leveraging MLLM Embeddings and Attribute Smoothing for Compositional Zero-Shot Learning
- Authors: Xudong Yan, Songhe Feng, Yang Zhang, Jian Yang, Yueguan Lin, Haojun Fei,
- Abstract summary: Compositional zero-shot learning aims to recognize novel compositions of attributes and objects learned from seen compositions.<n>Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object)<n>We propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL.
- Score: 21.488599805772054
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Compositional zero-shot learning (CZSL) aims to recognize novel compositions of attributes and objects learned from seen compositions. Previous works disentangle attributes and objects by extracting shared and exclusive parts between the image pair sharing the same attribute (object), as well as aligning them with pretrained word embeddings to improve unseen attribute-object recognition. Despite the significant achievements of existing efforts, they are hampered by three limitations: (1) The efficacy of disentanglement is compromised due to the influence of the background and the intricate entanglement of attributes with objects in the same parts. (2) Existing word embeddings fail to capture complex multimodal semantic information. (3) Overconfidence exhibited by existing models in seen compositions hinders their generalization to novel compositions. Being aware of these, we propose a novel framework named multimodal large language model (MLLM) embeddings and attribute smoothing guided disentanglement for CZSL. First, we leverage feature adaptive aggregation modules to mitigate the impact of background, and utilize learnable condition masks to capture multi-granularity features for disentanglement. Moreover, the last hidden states of MLLM are employed as word embeddings for their superior representation capabilities. Furthermore, we propose attribute smoothing with auxiliary attributes generated by the large language model (LLM) for seen compositions to address the overconfidence challenge. Extensive experiments demonstrate that our method achieves state-of-the-art performance on three challenging datasets. The source code will be available at https://github.com/xud-yan/Trident .
Related papers
- A Conditional Probability Framework for Compositional Zero-shot Learning [86.86063926727489]
Compositional Zero-Shot Learning (CZSL) aims to recognize unseen combinations of known objects and attributes by leveraging knowledge from previously seen compositions.<n>Traditional approaches primarily focus on disentangling attributes and objects, treating them as independent entities during learning.<n>We adopt a Conditional Probability Framework (CPF) to explicitly model attribute-object dependencies.
arXiv Detail & Related papers (2025-07-23T10:20:52Z) - LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching [25.883546163390957]
We endow CLIP with fine-grained action-level understanding by incorporating action-related external knowledge generated by large language models (LLMs)<n>We propose an adaptive interaction module to aggregate attentive visual features conditioned on action-aware prompted knowledge for establishing discriminative and action-aware visual representations.
arXiv Detail & Related papers (2025-06-30T03:49:08Z) - IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.
We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.
We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - LiMoE: Mixture of LiDAR Representation Learners from Automotive Scenes [55.33167217384738]
LiMoE is a framework that integrates the Mixture of Experts (MoE) paradigm into LiDAR data representation learning.<n>Our approach consists of three stages: Image-to-LiDAR Pretraining, Contrastive Mixture Learning (CML), and Semantic Mixture Supervision (SMS)
arXiv Detail & Related papers (2025-01-07T18:59:58Z) - Compositional Zero-Shot Learning with Contextualized Cues and Adaptive Contrastive Training [17.893694262999826]
This paper introduces a novel framework, Understanding and Linking Attributes and Objects (ULAO) in Compositional Zero-Shot Learning (CZSL)
ULAO comprises two innovative modules. The Understanding Attributes and Objects (UAO) module improves primitive understanding by sequential primitive prediction and leveraging recognized objects as contextual hints for attribute classification.
The Linking Attributes and Objects (LAO) module improves the attribute-object linkage understanding through a new contrastive learning strategy that incorporates tailored hard negative generation and adaptive loss adjustments.
arXiv Detail & Related papers (2024-12-10T03:41:20Z) - CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP [56.199779065855004]
We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations.
Experiments on the CIFAR-100 and Flickr30K datasets demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples.
arXiv Detail & Related papers (2024-10-30T17:51:31Z) - Cross-composition Feature Disentanglement for Compositional Zero-shot Learning [49.919635694894204]
Disentanglement of visual features of primitives (i.e., attributes and objects) has shown exceptional results in Compositional Zero-shot Learning (CZSL)
We propose the solution of cross-composition feature disentanglement, which takes multiple primitive-sharing compositions as inputs and constrains the disentangled primitive features to be general across these compositions.
arXiv Detail & Related papers (2024-08-19T08:23:09Z) - RAR: Retrieving And Ranking Augmented MLLMs for Visual Recognition [78.97487780589574]
Multimodal Large Language Models (MLLMs) excel at classifying fine-grained categories.
This paper introduces a Retrieving And Ranking augmented method for MLLMs.
Our proposed approach not only addresses the inherent limitations in fine-grained recognition but also preserves the model's comprehensive knowledge base.
arXiv Detail & Related papers (2024-03-20T17:59:55Z) - Hierarchical Visual Primitive Experts for Compositional Zero-Shot
Learning [52.506434446439776]
Compositional zero-shot learning (CZSL) aims to recognize compositions with prior knowledge of known primitives (attribute and object)
We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues.
Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL.
arXiv Detail & Related papers (2023-08-08T03:24:21Z) - Language-free Compositional Action Generation via Decoupling Refinement [67.50452446686725]
We introduce a novel framework to generate compositional actions without reliance on language auxiliaries.
Our approach consists of three main components: Action Coupling, Conditional Action Generation, and Decoupling Refinement.
arXiv Detail & Related papers (2023-07-07T12:00:38Z) - Learning Conditional Attributes for Compositional Zero-Shot Learning [78.24309446833398]
Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts.
One of the challenges is to model attributes interacted with different objects, e.g., the attribute wet" in wet apple" and wet cat" is different.
We argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings.
arXiv Detail & Related papers (2023-05-29T08:04:05Z) - Decomposed Soft Prompt Guided Fusion Enhancing for Compositional
Zero-Shot Learning [15.406125901927004]
We propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP)1, by involving vision-language models (VLMs) for unseen composition recognition.
Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them.
In addition, a cross-modal fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features.
arXiv Detail & Related papers (2022-11-19T12:29:12Z) - DUET: Cross-modal Semantic Grounding for Contrastive Zero-shot Learning [37.48292304239107]
We present a transformer-based end-to-end ZSL method named DUET.
We develop a cross-modal semantic grounding network to investigate the model's capability of disentangling semantic attributes from the images.
We find that DUET can often achieve state-of-the-art performance, its components are effective and its predictions are interpretable.
arXiv Detail & Related papers (2022-07-04T11:12:12Z) - Learning Invariant Visual Representations for Compositional Zero-Shot
Learning [30.472541551048508]
Compositional Zero-Shot Learning (CZSL) aims to recognize novel compositions using knowledge learned from seen-object compositions in the training set.
We propose an invariant feature learning framework to align different domains at the representation and gradient levels.
Experiments on two CZSL benchmarks demonstrate that the proposed method significantly outperforms the previous state-of-the-art.
arXiv Detail & Related papers (2022-06-01T11:33:33Z) - Disentangling Visual Embeddings for Attributes and Objects [38.27308243429424]
We study the problem of compositional zero-shot learning for object-attribute recognition.
Prior works use visual features extracted with a backbone network, pre-trained for object classification.
We propose a novel architecture that can disentangle attribute and object features in the visual space.
arXiv Detail & Related papers (2022-05-17T17:59:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.