The Finer the Better: Towards Granular-aware Open-set Domain Generalization
- URL: http://arxiv.org/abs/2511.16979v1
- Date: Fri, 21 Nov 2025 06:19:19 GMT
- Title: The Finer the Better: Towards Granular-aware Open-set Domain Generalization
- Authors: Yunyun Wang, Zheng Duan, Xinyue Liao, Ke-Jia Chen, Songcan Chen,
- Abstract summary: Open-Set Domain Generalization tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories.<n>Existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes.<n>We propose a Semantic-enhanced CLIP framework that explicitly addresses this dilemma through fine-grained semantic enhancement.
- Score: 31.197204515055756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-Set Domain Generalization (OSDG) tackles the realistic scenario where deployed models encounter both domain shifts and novel object categories. Despite impressive progress with vision-language models like CLIP, existing methods still fall into the dilemma between structural risk of known-classes and open-space risk from unknown-classes, and easily suffers from over-confidence, especially when distinguishing ``hard unknowns" that share fine-grained visual similarities with known classes. To this end, we propose a Semantic-enhanced CLIP (SeeCLIP) framework that explicitly addresses this dilemma through fine-grained semantic enhancement. In SeeCLIP, we propose a semantic-aware prompt enhancement module to decompose images into discriminative semantic tokens, enabling nuanced vision-language alignment beyond coarse category labels. To position unknown prompts effectively, we introduce duplex contrastive learning with complementary objectives, that is, repulsion to maintain separability from known classes, and cohesion to preserve semantic proximity. Further, our semantic-guided diffusion module synthesizes pseudo-unknowns by perturbing extracted semantic tokens, generating challenging samples that are visually similar to known classes yet exhibit key local differences. These hard negatives force the model to learn finer decision boundaries. Extensive experiments across five benchmarks demonstrate consistent improvements of 3% accuracy and 5% H-score over state-of-the-art methods.
Related papers
- In defense of the two-stage framework for open-set domain adaptive semantic segmentation [114.08201544572546]
Open-Set Domain Adaptation for Semantic Training (OSDA-SS) requires both domain adaptation for known classes and the distinction of unknowns.<n>We propose SATS, a Separating-then-Adapting Training Strategy, which addresses OSDA-SS through two sequential steps: known/unknown separation and unknown-aware domain adaptation.<n>Our method ensures a balanced learning of discriminative features for both known and unknown classes, steering the model toward discovering truly unknown objects.
arXiv Detail & Related papers (2026-01-04T08:58:03Z) - Generalized Decoupled Learning for Enhancing Open-Vocabulary Dense Perception [71.26728044621458]
DeCLIP is a novel framework that enhances CLIP by decoupling the self-attention module to obtain content'' and context'' features respectively.<n>It consistently achieves state-of-the-art performance across a broad spectrum of tasks, including 2D detection and segmentation, 3D instance segmentation, video instance segmentation, and 6D object pose estimation.
arXiv Detail & Related papers (2025-08-15T06:43:51Z) - Vocabulary-free Fine-grained Visual Recognition via Enriched Contextually Grounded Vision-Language Model [52.01031460230826]
Traditional approaches rely heavily on fixed vocabularies and closed-set classification paradigms.<n>Recent research has demonstrated that combining large language models with vision-language models (VLMs) makes open-set recognition possible.<n>We propose our training-free method, Enriched-FineR, which demonstrates state-of-the-art results in fine-grained visual recognition.
arXiv Detail & Related papers (2025-07-30T20:06:01Z) - CCL-LGS: Contrastive Codebook Learning for 3D Language Gaussian Splatting [53.15827818829865]
Methods that rely on 2D priors are prone to a critical challenge: cross-view semantic inconsistencies.<n>We propose CCL-LGS, a novel framework that enforces view-consistent semantic supervision by integrating multi-view semantic cues.<n>Our framework explicitly resolves semantic conflicts while preserving category discriminability.
arXiv Detail & Related papers (2025-05-26T19:09:33Z) - Unknown Prompt, the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization [12.126495847808803]
We introduce ODG-CLIP, harnessing the semantic prowess of the vision-language model, CLIP.
We conceptualize ODG as a multi-class classification challenge encompassing both known and novel categories.
We infuse images with class-discriminative knowledge derived from the prompt space to augment the fidelity of CLIP's visual embeddings.
arXiv Detail & Related papers (2024-03-31T15:03:31Z) - Language-Driven Visual Consensus for Zero-Shot Semantic Segmentation [114.72734384299476]
We propose a Language-Driven Visual Consensus (LDVC) approach, fostering improved alignment of semantic and visual information.
We leverage class embeddings as anchors due to their discrete and abstract nature, steering vision features toward class embeddings.
Our approach significantly boosts the capacity of segmentation models for unseen classes.
arXiv Detail & Related papers (2024-03-13T11:23:55Z) - Cluster-based Contrastive Disentangling for Generalized Zero-Shot
Learning [25.92340532509084]
Generalized Zero-Shot Learning (GZSL) aims to recognize both seen and unseen classes by training only the seen classes.
We propose a Cluster-based Contrastive Disentangling (CCD) method to improve GZSL by alleviating the semantic gap and domain shift problems.
arXiv Detail & Related papers (2022-03-05T02:50:12Z) - Learning Aligned Cross-Modal Representation for Generalized Zero-Shot
Classification [17.177622259867515]
We propose an innovative autoencoder network by learning Aligned Cross-Modal Representations (dubbed ACMR) for Generalized Zero-Shot Classification (GZSC)
Specifically, we propose a novel Vision-Semantic Alignment (VSA) method to strengthen the alignment of cross-modal latent features on the latent subspaces guided by a learned classifier.
In addition, we propose a novel Information Enhancement Module (IEM) to reduce the possibility of latent variables collapse meanwhile encouraging the discriminative ability of latent variables.
arXiv Detail & Related papers (2021-12-24T03:35:37Z) - Exploiting a Joint Embedding Space for Generalized Zero-Shot Semantic
Segmentation [25.070027668717422]
Generalized zero-shot semantic segmentation (GZS3) predicts pixel-wise semantic labels for seen and unseen classes.
Most GZS3 methods adopt a generative approach that synthesizes visual features of unseen classes from corresponding semantic ones.
We propose a discriminative approach to address limitations in a unified framework.
arXiv Detail & Related papers (2021-08-14T13:33:58Z) - Deep Clustering by Semantic Contrastive Learning [67.28140787010447]
We introduce a novel variant called Semantic Contrastive Learning (SCL)
It explores the characteristics of both conventional contrastive learning and deep clustering.
It can amplify the strengths of contrastive learning and deep clustering in a unified approach.
arXiv Detail & Related papers (2021-03-03T20:20:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.