Related papers: TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning

URL: http://arxiv.org/abs/2510.20162v1
Date: Thu, 23 Oct 2025 03:20:29 GMT
Title: TOMCAT: Test-time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning
Authors: Xudong Yan, Songhe Feng,
Abstract summary: Compositional Zero-Shot Learning aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones.<n>Existing methods suffer from performance degradation caused by the distribution shift of label space at test time.<n>We propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities to update multimodal prototypes at test time.
Score: 35.14123452166428
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones. Existing methods suffer from performance degradation caused by the distribution shift of label space at test time, which stems from the inclusion of unseen compositions recombined from attributes and objects. To overcome the challenge, we propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities from unsupervised data to update multimodal prototypes at test time. Building on this, we further design an adaptive update weight to control the degree of prototype adjustment, enabling the model to flexibly adapt to distribution shift during testing. Moreover, a dynamic priority queue is introduced that stores high-confidence images to acquire visual knowledge from historical images for inference. Considering the semantic consistency of multimodal knowledge, we align textual and visual prototypes by multimodal collaborative representation learning. Extensive experiments indicate that our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings. Code will be available at https://github.com/xud-yan/TOMCAT .

Related papers

WARM-CAT: Warm-Started Test-Time Comprehensive Knowledge Accumulation for Compositional Zero-Shot Learning [41.10398503450224]
Compositional Zero-Shot Learning (CZSL) aims to recognize novel attribute-object compositions based on the knowledge learned from seen ones.<n>We propose a novel approach that accumulates comprehensive knowledge in both textual and visual modalities to update multimodal prototypes at test time.<n>Our approach achieves state-of-the-art performance on four benchmark datasets under both closed-world and open-world settings.
arXiv Detail & Related papers (2026-02-26T15:27:17Z)
Dynamic Multimodal Prototype Learning in Vision-Language Models [44.84161970425967]
We introduce textbfProtoMM, a training-free framework that constructs multimodal prototypes to adapt vision-language models during the test time.<n>By viewing the prototype as a discrete distribution over the textual descriptions and visual particles, ProtoMM has the ability to combine the multimodal features for comprehensive prototype learning.
arXiv Detail & Related papers (2025-07-04T15:31:47Z)
Cross-Modal Prototype Allocation: Unsupervised Slide Representation Learning via Patch-Text Contrast in Computational Pathology [10.811667603360041]
ProAlign is a cross-modal unsupervised slide representation learning framework.<n>We leverage a large language model (LLM) to generate descriptive text for the prototype types present in a whole slide image.<n>We propose a parameter-free attention aggregation strategy that utilizes the similarity between patches and these prototypes to form unsupervised slide embeddings.
arXiv Detail & Related papers (2025-03-26T03:31:07Z)
Adapting Vision-Language Models to Open Classes via Test-Time Prompt Tuning [50.26965628047682]
Adapting pre-trained models to open classes is a challenging problem in machine learning. In this paper, we consider combining the advantages of both and come up with a test-time prompt tuning approach. Our proposed method outperforms all comparison methods on average considering both base and new classes.
arXiv Detail & Related papers (2024-08-29T12:34:01Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance. This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z)
Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings. We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z)
Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder with Semantic Concepts [0.9054540533394924]
Recent techniques try to learn a cross-modal mapping between the semantic space and the image space. We propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space. Our results show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
arXiv Detail & Related papers (2021-06-26T20:08:37Z)
Distribution Alignment: A Unified Framework for Long-tail Visual Recognition [52.36728157779307]
We propose a unified distribution alignment strategy for long-tail visual recognition. We then introduce a generalized re-weight method in the two-stage learning to balance the class prior. Our approach achieves the state-of-the-art results across all four recognition tasks with a simple and unified framework.
arXiv Detail & Related papers (2021-03-30T14:09:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.