Related papers: Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering

URL: http://arxiv.org/abs/2511.07274v1
Date: Mon, 10 Nov 2025 16:21:46 GMT
Title: Multi-modal Dynamic Proxy Learning for Personalized Multiple Clustering
Authors: Jinfeng Xu, Zheyu Chen, Shuo Yang, Jinze Li, Ziyue Peng, Zewei Liu, Hewei Wang, Jiayi Zhang, Edith C. H. Ngai,
Abstract summary: Multiple clustering aims to discover diverse latent structures from different perspectives.<n>Existing methods generate exhaustive clusterings without discerning user interest.<n>We propose Multi-D Proxy, a novel multi-modal dynamic proxy learning framework.
Score: 19.73004884573164
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple clustering aims to discover diverse latent structures from different perspectives, yet existing methods generate exhaustive clusterings without discerning user interest, necessitating laborious manual screening. Current multi-modal solutions suffer from static semantic rigidity: predefined candidate words fail to adapt to dataset-specific concepts, and fixed fusion strategies ignore evolving feature interactions. To overcome these limitations, we propose Multi-DProxy, a novel multi-modal dynamic proxy learning framework that leverages cross-modal alignment through learnable textual proxies. Multi-DProxy introduces 1) gated cross-modal fusion that synthesizes discriminative joint representations by adaptively modeling feature interactions. 2) dual-constraint proxy optimization where user interest constraints enforce semantic consistency with domain concepts while concept constraints employ hard example mining to enhance cluster discrimination. 3) dynamic candidate management that refines textual proxies through iterative clustering feedback. Therefore, Multi-DProxy not only effectively captures a user's interest through proxies but also enables the identification of relevant clusterings with greater precision. Extensive experiments demonstrate state-of-the-art performance with significant improvements over existing methods across a broad set of multi-clustering benchmarks.

Related papers

Multi-Paradigm Collaborative Adversarial Attack Against Multi-Modal Large Language Models [67.45032003041399]
We propose a novel Multi-Paradigm Collaborative Attack (MPCAttack) framework to boost the transferability of adversarial examples against MLLMs.<n>MPCO adaptively balances the importance of different paradigm representations and guides the global optimisation.<n>Our solution consistently outperforms state-of-the-art methods in both targeted and untargeted attacks on open-source and closed-source MLLMs.
arXiv Detail & Related papers (2026-03-05T06:01:26Z)
MARTI-MARS$^2$: Scaling Multi-Agent Self-Search via Reinforcement Learning for Code Generation [64.2621682259008]
Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2)<n>We propose a Multi-Agent Reinforced Training and Inference Framework with Self-Search Scaling (MARTI-MARS2) to integrate policy learning with multi-agent tree search.<n>We show that MARTI-MARS2 achieves 77.7%, outperforming strong baselines like GPT-5.1 on challenging code generation benchmarks.
arXiv Detail & Related papers (2026-02-08T07:28:44Z)
From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
Federated Multi-Task Clustering [44.73672172790804]
This paper proposes a novel framework named Federated Multi-Task Clustering (i.e.,FMTC)<n>It is composed of two main components: client-side personalized clustering module and server-side tensorial correlation module.<n>We derive an efficient, privacy-preserving distributed algorithm based on the Alternating Direction Method of Multipliers.
arXiv Detail & Related papers (2025-12-28T12:02:32Z)
UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception [54.53657134205492]
UniAlignment is a unified multimodal generation framework within a single diffusion transformer.<n>It incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness.<n>We present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions.
arXiv Detail & Related papers (2025-09-28T09:11:30Z)
MMQ: Multimodal Mixture-of-Quantization Tokenization for Semantic ID Generation and User Behavioral Adaptation [16.81485354427923]
We propose Multimodal Mixture-of-Quantization (MMQ), a two-stage framework that trains a novel multimodal tokenizer.<n> MMQ unifies multimodal synergy, specificity, and behavioral adaptation, providing a scalable and versatile solution for both generative retrieval and discriminative ranking tasks.
arXiv Detail & Related papers (2025-08-21T06:15:49Z)
Maximal Matching Matters: Preventing Representation Collapse for Robust Cross-Modal Retrieval [0.5999777817331317]
Cross-modal image-text retrieval is challenging because of the diverse possible associations between content from different modalities.<n>Traditional methods learn a single-vector embedding to represent semantics of each sample.<n>Set-based approaches, which represent each sample with multiple embeddings, offer a promising alternative.
arXiv Detail & Related papers (2025-06-26T17:55:34Z)
MCFNet: A Multimodal Collaborative Fusion Network for Fine-Grained Semantic Classification [2.7936465461948945]
Multimodal Collaborative Fusion Network (MCFNet) designed for fine-grained classification.<n>MCFNet architecture incorporates a regularized integrated fusion module that improves intra-modal feature representation.<n> multimodal decision classification module exploits inter-modal correlations and unimodal discriminative features.
arXiv Detail & Related papers (2025-05-29T11:42:57Z)
Multi Activity Sequence Alignment via Implicit Clustering [50.3168866743067]
We propose a novel framework that overcomes limitations using sequence alignment via implicit clustering.<n>Specifically, our key idea is to perform implicit clip-level clustering while aligning frames in sequences.<n>Our experiments show that our proposed method outperforms state-of-the-art results.
arXiv Detail & Related papers (2025-03-16T14:28:46Z)
Customized Multiple Clustering via Multi-Modal Subspace Proxy Learning [8.447067012487866]
We introduce Multi-Sub, a novel end-to-end multiple clustering approach that incorporates a multi-modal subspace proxy learning framework. Our method consistently outperforms existing baselines across a broad set of datasets in visual multiple clustering tasks.
arXiv Detail & Related papers (2024-11-06T15:14:27Z)
Unified Generative and Discriminative Training for Multi-modal Large Language Models [88.84491005030316]
Generative training has enabled Vision-Language Models (VLMs) to tackle various complex tasks. Discriminative training, exemplified by models like CLIP, excels in zero-shot image-text classification and retrieval. This paper proposes a unified approach that integrates the strengths of both paradigms.
arXiv Detail & Related papers (2024-11-01T01:51:31Z)
Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering [8.447067012487866]
Multi-MaP is a novel method employing a multi-modal proxy learning process. It captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our experiments show that Multi-MaP consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks.
arXiv Detail & Related papers (2024-04-24T05:20:42Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Preserving Modality Structure Improves Multi-Modal Learning [64.10085674834252]
Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings without relying on human annotations. These methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. We propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space.
arXiv Detail & Related papers (2023-08-24T20:46:48Z)
Learning Deep Multimodal Feature Representation with Asymmetric Multi-layer Fusion [63.72912507445662]
We propose a compact and effective framework to fuse multimodal features at multiple layers in a single network. We verify that multimodal features can be learnt within a shared single network by merely maintaining modality-specific batch normalization layers in the encoder. Secondly, we propose a bidirectional multi-layer fusion scheme, where multimodal features can be exploited progressively.
arXiv Detail & Related papers (2021-08-11T03:42:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.