Hierarchical Semantic Alignment for Image Clustering
- URL: http://arxiv.org/abs/2512.00904v1
- Date: Sun, 30 Nov 2025 14:14:51 GMT
- Title: Hierarchical Semantic Alignment for Image Clustering
- Authors: Xingyu Zhu, Beier Zhu, Yunfan Li, Junfeng Fang, Shuo Wang, Kesen Zhao, Hanwang Zhang,
- Abstract summary: We propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves cluster- ing performance in a training-free manner.<n>We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features.<n>Then, we align image features with selected nouns and captions via optimal transport to obtain a more discriminative semantic space.
- Score: 59.277605709780524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image clustering is a classic problem in computer vision, which categorizes images into different groups. Recent studies utilize nouns as external semantic knowledge to improve clus- tering performance. However, these methods often overlook the inherent ambiguity of nouns, which can distort semantic representations and degrade clustering quality. To address this issue, we propose a hierarChical semAntic alignmEnt method for image clustering, dubbed CAE, which improves cluster- ing performance in a training-free manner. In our approach, we incorporate two complementary types of textual seman- tics: caption-level descriptions, which convey fine-grained attributes of image content, and noun-level concepts, which represent high-level object categories. We first select relevant nouns from WordNet and descriptions from caption datasets to construct a semantic space aligned with image features. Then, we align image features with selected nouns and captions via optimal transport to obtain a more discriminative semantic space. Finally, we combine the enhanced semantic and image features to perform clustering. Extensive experiments across 8 datasets demonstrate the effectiveness of our method, notably surpassing the state-of-the-art training-free approach with a 4.2% improvement in accuracy and a 2.9% improvement in adjusted rand index (ARI) on the ImageNet-1K dataset.
Related papers
- AdaptaGen: Domain-Specific Image Generation through Hierarchical Semantic Optimization Framework [0.0]
Domain-specific image generation aims to produce high-quality visual content for specialized fields.<n>Current approaches overlook the inherent dependence between semantic understanding and visual representation in specialized domains.<n>We propose AdaptaGen, a hierarchical semantic optimization framework that integrates matrix-based prompt optimization with multi-perspective understanding.
arXiv Detail & Related papers (2025-07-08T03:04:08Z) - Finetuning CLIP to Reason about Pairwise Differences [52.028073305958074]
We propose an approach to train vision-language models such as CLIP in a contrastive manner to reason about differences in embedding space.<n>We finetune CLIP so that text descriptions of differences between images correspond to their difference in image embedding space.<n>Our approach yields significantly improved capabilities in ranking images by a certain attribute, and improved zeroshot classification performance on many downstream image classification tasks.
arXiv Detail & Related papers (2024-09-15T13:02:14Z) - Dual-Level Cross-Modal Contrastive Clustering [4.083185193413678]
We propose a novel image clustering framwork, named Dual-level Cross-Modal Contrastive Clustering (DXMC)
external textual information is introduced for constructing a semantic space which is adopted to generate image-text pairs.
The image-text pairs are respectively sent to pre-trained image and text encoder to obtain image and text embeddings which subsquently are fed into four well-designed networks.
arXiv Detail & Related papers (2024-09-06T18:49:45Z) - Vocabulary-free Image Classification and Semantic Segmentation [71.78089106671581]
We introduce the Vocabulary-free Image Classification (VIC) task, which aims to assign a class from an un-constrained language-induced semantic space to an input image without needing a known vocabulary.
VIC is challenging due to the vastness of the semantic space, which contains millions of concepts, including fine-grained categories.
We propose Category Search from External Databases (CaSED), a training-free method that leverages a pre-trained vision-language model and an external database.
arXiv Detail & Related papers (2024-04-16T19:27:21Z) - Vocabulary-free Image Classification [75.38039557783414]
We formalize a novel task, termed as Vocabulary-free Image Classification (VIC)
VIC aims to assign to an input image a class that resides in an unconstrained language-induced semantic space, without the prerequisite of a known vocabulary.
CaSED is a method that exploits a pre-trained vision-language model and an external vision-language database to address VIC in a training-free manner.
arXiv Detail & Related papers (2023-06-01T17:19:43Z) - ISLE: A Framework for Image Level Semantic Segmentation Ensemble [5.137284292672375]
Conventional semantic segmentation networks require massive pixel-wise annotated labels to reach state-of-the-art prediction quality.
We propose ISLE, which employs an ensemble of the "pseudo-labels" for a given set of different semantic segmentation techniques on a class-wise level.
We reach up to 2.4% improvement over ISLE's individual components.
arXiv Detail & Related papers (2023-03-14T13:36:36Z) - Semantic-Enhanced Image Clustering [6.218389227248297]
We propose to investigate the task of image clustering with the help of a visual-language pre-training model.
How to map images to a proper semantic space and how to cluster images from both image and semantic spaces are two key problems.
We propose a method to map the given images to a proper semantic space first and efficient methods to generate pseudo-labels according to the relationships between images and semantics.
arXiv Detail & Related papers (2022-08-21T09:04:21Z) - Attention-Guided Supervised Contrastive Learning for Semantic
Segmentation [16.729068267453897]
In a per-pixel prediction task, more than one label can exist in a single image for segmentation.
We propose an attention-guided supervised contrastive learning approach to highlight a single semantic object every time as the target.
arXiv Detail & Related papers (2021-06-03T05:01:11Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Mining Cross-Image Semantics for Weakly Supervised Semantic Segmentation [128.03739769844736]
Two neural co-attentions are incorporated into the classifier to capture cross-image semantic similarities and differences.
In addition to boosting object pattern learning, the co-attention can leverage context from other related images to improve localization map inference.
Our algorithm sets new state-of-the-arts on all these settings, demonstrating well its efficacy and generalizability.
arXiv Detail & Related papers (2020-07-03T21:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.