Related papers: From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts

From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts

URL: http://arxiv.org/abs/2511.22897v1
Date: Fri, 28 Nov 2025 06:03:35 GMT
Title: From Points to Clouds: Learning Robust Semantic Distributions for Multi-modal Prompts
Authors: Weiran Li, Yeqiang Liu, Yijie Wei, Mina Han, Xin Liu, Zhenbo Li,
Abstract summary: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs)<n>We introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframe prompt learning as a dynamic denoising task.<n>P2C consistently outperforms strong baselines in experiments across 11 datasets.
Score: 11.693848445032259
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Prompt Learning (MPL) has emerged as a pivotal technique for adapting large-scale Visual Language Models (VLMs). However, current MPL methods are fundamentally limited by their optimization of a single, static point representation. This paradigm is inherently brittle, leads to overfitting on base classes, and generalizes poorly to novel or ambiguous categories. We challenge this point paradigm, proposing that robust generalization requires learning a semantic cloud (i.e., a distribution over the embedding space). To achieve this, we introduce Points-to-Clouds (P2C), a novel framework inspired by diffusion models that reframes prompt learning as a dynamic denoising task. At the core of P2C is a dual denoising mechanism: a Dynamic Prompt Denoising (DPD) mechanism perturbs text prompts with sophisticated, annealed noise to learn a smoother semantic landscape, while an auxiliary V-L Mapper denoising loss re-tasks the mapper as a denoising autoencoder. This forces the mapper to reconstruct clean visual prompts from noisy text inputs, ensuring robust cross-modal alignment. Extensive experiments across 11 datasets demonstrate that P2C consistently outperforms strong baselines. On the base-to-novel generalization benchmark, our method achieves a Harmonic Mean of 79.7%, representing a relative improvement of 1.4% over the baseline. The code and models are available at https://vranlee.github.io/P2C/.

Related papers

E2PL: Effective and Efficient Prompt Learning for Incomplete Multi-view Multi-Label Class Incremental Learning [23.648354515768734]
We introduce textsfE2PL, an effective and efficient prompt learning framework for IMvMLCIL.<n>We show that textsfE2PL consistently outperforms state-of-the-art methods in both effectiveness and efficiency.
arXiv Detail & Related papers (2026-01-23T03:30:47Z)
DynaPURLS: Dynamic Refinement of Part-aware Representations for Skeleton-based Zero-Shot Action Recognition [51.80782323686666]
We introduce textbfDynaPURLS, a unified framework that establishes robust, multi-scale visual-semantic correspondences.<n>Our framework leverages a large language model to generate hierarchical textual descriptions that encompass both global movements and local body-part dynamics.<n>Experiments on three large-scale benchmark datasets, including NTU RGB+D 60/120 and PKU-MMD, demonstrate that DynaPURLS significantly outperforms prior art.
arXiv Detail & Related papers (2025-12-12T10:39:10Z)
DeepInsert: Early Layer Bypass for Efficient and Performant Multimodal Understanding [26.39397960987363]
We propose a simple modification to pretrained transformer models.<n>Instead of concatenation with the language prompt at the start, we insert multimodal tokens directly into the middle.<n>Our results indicate that our method reduces computational costs during both training and inference.
arXiv Detail & Related papers (2025-04-27T18:56:26Z)
DepthMaster: Taming Diffusion Models for Monocular Depth Estimation [41.81343543266191]
We propose a single-step diffusion model designed to adapt generative features for the discriminative depth estimation task.<n>We adopt a two-stage training strategy to fully leverage the potential of the two modules.<n>Our model achieves state-of-the-art performance in terms of generalization and detail preservation, outperforming other diffusion-based methods across various datasets.
arXiv Detail & Related papers (2025-01-05T15:18:32Z)
Enhance Vision-Language Alignment with Noise [59.2608298578913]
We investigate whether the frozen model can be fine-tuned by customized noise.<n>We propose Positive-incentive Noise (PiNI) which can fine-tune CLIP via injecting noise into both visual and text encoders.
arXiv Detail & Related papers (2024-12-14T12:58:15Z)
Policy Gradient-Driven Noise Mask [3.69758875412828]
We propose a novel pretraining pipeline that learns to generate conditional noise masks specifically tailored to improve performance on multi-modal and multi-organ datasets.<n>A key aspect is that the policy network's role is limited to obtaining an intermediate (or heated) model before fine-tuning.<n>Results demonstrate that fine-tuning the intermediate models consistently outperforms conventional training algorithms on both classification and generalization to unseen concept tasks.
arXiv Detail & Related papers (2024-04-29T23:53:42Z)
V2A-Mapper: A Lightweight Solution for Vision-to-Audio Generation by Connecting Foundation Models [14.538853403226751]
Building artificial intelligence systems on top of a set of foundation models (FMs) is becoming a new paradigm in AI research. We propose a lightweight solution to this problem by leveraging foundation models, specifically CLIP, CLAP, and AudioLDM. Our method only requires a quick training of the V2A-Mapper to produce high-fidelity and visually-aligned sound.
arXiv Detail & Related papers (2023-08-18T04:49:38Z)
LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models [67.19124099815645]
We propose a novel Language-Aware Soft Prompting (LASP) learning method to alleviate base class overfitting. LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available. LASP matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets.
arXiv Detail & Related papers (2022-10-03T17:56:35Z)
Decoupled Multi-task Learning with Cyclical Self-Regulation for Face Parsing [71.19528222206088]
We propose a novel Decoupled Multi-task Learning with Cyclical Self-Regulation for face parsing. Specifically, DML-CSR designs a multi-task model which comprises face parsing, binary edge, and category edge detection. Our method achieves the new state-of-the-art performance on the Helen, CelebA-HQ, and LapaMask datasets.
arXiv Detail & Related papers (2022-03-28T02:12:30Z)
Virtual Data Augmentation: A Robust and General Framework for Fine-tuning Pre-trained Models [51.46732511844122]
Powerful pre-trained language models (PLM) can be fooled by small perturbations or intentional attacks. We present Virtual Data Augmentation (VDA), a general framework for robustly fine-tuning PLMs. Our approach is able to improve the robustness of PLMs and alleviate the performance degradation under adversarial attacks.
arXiv Detail & Related papers (2021-09-13T09:15:28Z)
ANIMC: A Soft Framework for Auto-weighted Noisy and Incomplete Multi-view Clustering [59.77141155608009]
We propose a novel Auto-weighted Noisy and Incomplete Multi-view Clustering framework (ANIMC) via a soft auto-weighted strategy and a doubly soft regular regression model. ANIMC has three unique advantages: 1) it is a soft algorithm to adjust our framework in different scenarios, thereby improving its generalization ability; 2) it automatically learns a proper weight for each view, thereby reducing the influence of noises; and 3) it aligns the same instances in different views, thereby decreasing the impact of missing instances.
arXiv Detail & Related papers (2020-11-20T10:37:27Z)
Prior Guided Feature Enrichment Network for Few-Shot Segmentation [64.91560451900125]
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results. Few-shot segmentation is proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information.
arXiv Detail & Related papers (2020-08-04T10:41:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.