Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering
- URL: http://arxiv.org/abs/2404.15655v1
- Date: Wed, 24 Apr 2024 05:20:42 GMT
- Title: Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering
- Authors: Jiawei Yao, Qi Qian, Juhua Hu,
- Abstract summary: Multi-MaP is a novel method employing a multi-modal proxy learning process.
It captures a user's interest via a keyword but also facilitates identifying relevant clusterings.
Our experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks.
- Score: 8.447067012487866
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However, a major challenge arises as users often do not need all the clusterings that algorithms generate, and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally, aligning a user's brief keyword of interest with the corresponding vision components was challenging, but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response, given unlabeled target visual data, we propose Multi-MaP, a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings, with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover, reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-MaP not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at https://github.com/Alexander-Yao/Multi-MaP.
Related papers
- Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning [8.447067012487866]
We introduce Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework.
Our method consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks.
arXiv Detail & Related papers (2024-11-06T15:14:27Z) - CDIMC-net: Cognitive Deep Incomplete Multi-view Clustering Network [53.72046586512026]
We propose a novel incomplete multi-view clustering network, called Cognitive Deep Incomplete Multi-view Clustering Network (CDIMC-net)
It captures the high-level features and local structure of each view by incorporating the view-specific deep encoders and graph embedding strategy into a framework.
Based on the human cognition, i.e., learning from easy to hard, it introduces a self-paced strategy to select the most confident samples for model training.
arXiv Detail & Related papers (2024-03-28T15:45:03Z) - Incomplete Contrastive Multi-View Clustering with High-Confidence
Guiding [7.305817202715752]
We propose a novel Incomplete Contrastive Multi-View Clustering method with high-confidence guiding (ICMVC)
Firstly, we proposed a multi-view consistency relation transfer plus graph convolutional network to tackle missing values problem.
Secondly, instance-level attention fusion and high-confidence guiding are proposed to exploit the complementary information.
arXiv Detail & Related papers (2023-12-14T07:28:41Z) - M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot
Fine-grained Action Recognition [80.21796574234287]
M$3$Net is a matching-based framework for few-shot fine-grained (FS-FG) action recognition.
It incorporates textitmulti-view encoding, textitmulti-view matching, and textitmulti-view fusion to facilitate embedding encoding, similarity matching, and decision making.
Explainable visualizations and experimental results demonstrate the superiority of M$3$Net in capturing fine-grained action details.
arXiv Detail & Related papers (2023-08-06T09:15:14Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - One-step Multi-view Clustering with Diverse Representation [47.41455937479201]
We propose a one-step multi-view clustering with diverse representation method, which incorporates multi-view learning and $k$-means into a unified framework.
We develop an efficient optimization algorithm with proven convergence to solve the resultant problem.
arXiv Detail & Related papers (2023-06-08T02:52:24Z) - Multi-view Semantic Consistency based Information Bottleneck for
Clustering [13.589996737740208]
We introduce a novel Multi-view Semantic Consistency based Information Bottleneck for clustering (MSCIB)
MSCIB pursues semantic consistency to improve the learning process of information bottleneck for different views.
It conducts the alignment operation of multiple views in the semantic space and jointly achieves the valuable consistent information of multi-view data.
arXiv Detail & Related papers (2023-02-28T02:01:58Z) - Fast Multi-view Clustering via Ensembles: Towards Scalability,
Superiority, and Simplicity [63.85428043085567]
We propose a fast multi-view clustering via ensembles (FastMICE) approach.
The concept of random view groups is presented to capture the versatile view-wise relationships.
FastMICE has almost linear time and space complexity, and is free of dataset-specific tuning.
arXiv Detail & Related papers (2022-03-22T09:51:24Z) - Face, Body, Voice: Video Person-Clustering with Multiple Modalities [85.0282742801264]
Previous methods focus on the narrower task of face-clustering.
Most current datasets evaluate only the task of face-clustering, rather than person-clustering.
We introduce a Video Person-Clustering dataset, for evaluating multi-modal person-clustering.
arXiv Detail & Related papers (2021-05-20T17:59:40Z) - Deep Incomplete Multi-View Multiple Clusterings [41.43164409639238]
We introduce a deep incomplete multi-view multiple clusterings framework, which achieves the completion of data view and multiple shared representations simultaneously.
Experiments on benchmark datasets confirm that DiMVMC outperforms the state-of-the-art competitors in generating multiple clusterings with high diversity and quality.
arXiv Detail & Related papers (2020-10-02T08:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.