Text-Guided Image Clustering
- URL: http://arxiv.org/abs/2402.02996v2
- Date: Mon, 19 Feb 2024 12:36:13 GMT
- Title: Text-Guided Image Clustering
- Authors: Andreas Stephan, Lukas Miklautz, Kevin Sidak, Jan Philip Wahle, Bela
Gipp, Claudia Plant, Benjamin Roth
- Abstract summary: We propose Text-Guided Image Clustering, i.e., generating text using image captioning and visual question-answering (VQA) models.
Across eight diverse image clustering datasets, our results show that the obtained text representations often outperform image features.
- Score: 15.217924518131268
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Image clustering divides a collection of images into meaningful groups,
typically interpreted post-hoc via human-given annotations. Those are usually
in the form of text, begging the question of using text as an abstraction for
image clustering. Current image clustering methods, however, neglect the use of
generated textual descriptions. We, therefore, propose Text-Guided Image
Clustering, i.e., generating text using image captioning and visual
question-answering (VQA) models and subsequently clustering the generated text.
Further, we introduce a novel approach to inject task- or domain knowledge for
clustering by prompting VQA models. Across eight diverse image clustering
datasets, our results show that the obtained text representations often
outperform image features. Additionally, we propose a counting-based cluster
explainability method. Our evaluations show that the derived keyword-based
explanations describe clusters better than the respective cluster accuracy
suggests. Overall, this research challenges traditional approaches and paves
the way for a paradigm shift in image clustering, using generated text.
Related papers
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method for semantic segmentation.
We introduce Contrastive Soft Clustering to align masks with the image's structure information.
InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - Text-Guided Alternative Image Clustering [11.103514372355088]
This work explores the potential of large vision-language models to facilitate alternative image clustering.
We propose Text-Guided Alternative Image Consensus Clustering (TGAICC), a novel approach that leverages user-specified interests via prompts.
TGAICC outperforms image- and text-based baselines on four alternative image clustering benchmark datasets.
arXiv Detail & Related papers (2024-06-07T08:37:57Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Clustering-based Image-Text Graph Matching for Domain Generalization [13.277406473107721]
Domain-invariant visual representations are important to train a model that can generalize well to unseen target task domains.
Recent works demonstrate that text descriptions contain high-level class-discriminative information.
We advocate for the use of local alignment between image regions and corresponding textual descriptions.
arXiv Detail & Related papers (2023-10-04T10:03:07Z) - Shatter and Gather: Learning Referring Image Segmentation with Text
Supervision [52.46081425504072]
We present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent.
Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks.
arXiv Detail & Related papers (2023-08-29T15:39:15Z) - CoC-GAN: Employing Context Cluster for Unveiling a New Pathway in Image
Generation [12.211795836214112]
We propose a unique image generation process premised on the perspective of converting images into a set of point clouds.
Our methodology leverages simple clustering methods named Context Clustering (CoC) to generate images from unordered point sets.
We introduce this model with the novel structure as the Context Clustering Generative Adversarial Network (CoC-GAN)
arXiv Detail & Related papers (2023-08-23T01:19:58Z) - Image as Set of Points [60.30495338399321]
Context clusters (CoCs) view an image as a set of unorganized points and extract features via simplified clustering algorithm.
Our CoCs are convolution- and attention-free, and only rely on clustering algorithm for spatial interaction.
arXiv Detail & Related papers (2023-03-02T18:56:39Z) - Adaptively Clustering Neighbor Elements for Image-Text Generation [78.82346492527425]
We propose a novel Transformer-based image-to-text generation model termed as textbfACF.
ACF adaptively clusters vision patches into object regions and language words into phrases to implicitly learn object-phrase alignments.
Experiment results demonstrate the effectiveness of ACF, which outperforms most SOTA captioning and VQA models.
arXiv Detail & Related papers (2023-01-05T08:37:36Z) - Semantic-Enhanced Image Clustering [6.218389227248297]
We propose to investigate the task of image clustering with the help of a visual-language pre-training model.
How to map images to a proper semantic space and how to cluster images from both image and semantic spaces are two key problems.
We propose a method to map the given images to a proper semantic space first and efficient methods to generate pseudo-labels according to the relationships between images and semantics.
arXiv Detail & Related papers (2022-08-21T09:04:21Z) - Text-to-Image Generation Grounded by Fine-Grained User Attention [62.94737811887098]
Localized Narratives is a dataset with detailed natural language descriptions of images paired with mouse traces.
We propose TReCS, a sequential model that exploits this grounding to generate images.
arXiv Detail & Related papers (2020-11-07T13:23:31Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.