COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
- URL: http://arxiv.org/abs/2503.23388v1
- Date: Sun, 30 Mar 2025 10:34:45 GMT
- Title: COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation
- Authors: Fanding Huang, Jingyan Jiang, Qinting Jiang, Hebei Li, Faisal Nadeem Khan, Zhi Wang,
- Abstract summary: Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains.<n>We propose COSMIC, a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching.<n>Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH)
- Score: 3.8212877227467548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy. To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50. Code is available at github.com/hf618/COSMIC.
Related papers
- FineCLIPER: Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs [5.35588281968644]
We propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (Fine CLIPER)
Our Fine CLIPER achieves tunable SOTA performance on the DFEW, FERV39k, and MAFW datasets with few parameters.
arXiv Detail & Related papers (2024-07-02T10:55:43Z) - A Pure Transformer Pretraining Framework on Text-attributed Graphs [50.833130854272774]
We introduce a feature-centric pretraining perspective by treating graph structure as a prior.
Our framework, Graph Sequence Pretraining with Transformer (GSPT), samples node contexts through random walks.
GSPT can be easily adapted to both node classification and link prediction, demonstrating promising empirical success on various datasets.
arXiv Detail & Related papers (2024-06-19T22:30:08Z) - CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification [3.594351309950969]
CapS-Adapter is an innovative method that harnesses both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios.
Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method.
arXiv Detail & Related papers (2024-05-26T14:50:40Z) - CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains.
We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z) - SEER-ZSL: Semantic Encoder-Enhanced Representations for Generalized Zero-Shot Learning [0.6792605600335813]
Zero-Shot Learning (ZSL) presents the challenge of identifying categories not seen during training.<n>We introduce a Semantic-Enhanced Representations for Zero-Shot Learning (SEER-ZSL)<n>First, we aim to distill meaningful semantic information using a probabilistic encoder, enhancing the semantic consistency and robustness.<n>Second, we distill the visual space by exploiting the learned data distribution through an adversarially trained generator. Third, we align the distilled information, enabling a mapping of unseen categories onto the true data manifold.
arXiv Detail & Related papers (2023-12-20T15:18:51Z) - Generalized Few-Shot Continual Learning with Contrastive Mixture of
Adapters [59.82088750033897]
We set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental situations.
We find that common continual learning methods have poor generalization ability on unseen domains.
In this way, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA)
arXiv Detail & Related papers (2023-02-12T15:18:14Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - Incremental Few-Shot Semantic Segmentation via Embedding Adaptive-Update
and Hyper-class Representation [30.558312809285905]
EHNet achieves new state-of-the-art performance with remarkable advantages.
Experiments on PASCAL-5i and COCO datasets show that EHNet achieves new state-of-the-art performance with remarkable advantages.
arXiv Detail & Related papers (2022-07-26T15:20:07Z) - Learning towards Synchronous Network Memorizability and Generalizability
for Continual Segmentation across Multiple Sites [52.84959869494459]
In clinical practice, a segmentation network is often required to continually learn on a sequential data stream from multiple sites.
Existing methods are usually restricted in either network memorizability on previous sites or generalizability on unseen sites.
This paper aims to tackle the problem of Synchronous Memorizability and Generalizability with a novel proposed SMG-learning framework.
arXiv Detail & Related papers (2022-06-14T13:04:36Z) - Hybrid Relation Guided Set Matching for Few-shot Action Recognition [51.3308583226322]
We propose a novel Hybrid Relation guided Set Matching (HyRSM) approach that incorporates two key components.
The purpose of the hybrid relation module is to learn task-specific embeddings by fully exploiting associated relations within and cross videos in an episode.
We evaluate HyRSM on six challenging benchmarks, and the experimental results show its superiority over the state-of-the-art methods by a convincing margin.
arXiv Detail & Related papers (2022-04-28T11:43:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.