Related papers: Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models

Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models

URL: http://arxiv.org/abs/2508.12861v1
Date: Mon, 18 Aug 2025 12:00:09 GMT
Title: Cross-Domain Few-Shot Learning via Multi-View Collaborative Optimization with Vision-Language Models
Authors: Dexia Chen, Wentao Zhang, Qianjie Zhu, Ping Hu, Weibing Li, Tong Zhang, Ruixuan Wang,
Abstract summary: Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks.<n>We propose Consistency-guided Multi-view Collaborative Optimization (CoMuCo), a novel fine-tuning strategy for VLMs.
Score: 37.63573703440172
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) pre-trained on natural image and language data, such as CLIP, have exhibited significant potential in few-shot image recognition tasks, leading to development of various efficient transfer learning methods. These methods exploit inherent pre-learned knowledge in VLMs and have achieved strong performance on standard image datasets. However, their effectiveness is often limited when confronted with cross-domain tasks where imaging domains differ from natural images. To address this limitation, we propose Consistency-guided Multi-view Collaborative Optimization (CoMuCo), a novel fine-tuning strategy for VLMs. This strategy employs two functionally complementary expert modules to extract multi-view features, while incorporating prior knowledge-based consistency constraints and information geometry-based consensus mechanisms to enhance the robustness of feature learning. Additionally, a new cross-domain few-shot benchmark is established to help comprehensively evaluate methods on imaging domains distinct from natural images. Extensive empirical evaluations on both existing and newly proposed benchmarks suggest CoMuCo consistently outperforms current methods in few-shot tasks. The code and benchmark will be released.

Related papers

Reconstructing Content via Collaborative Attention to Improve Multimodal Embedding Quality [59.651410243721045]
CoCoA is a Content reconstruction pre-training paradigm based on Collaborative Attention for multimodal embedding optimization.<n>We introduce an EOS-based reconstruction task, encouraging the model to reconstruct input from the corresponding EOS> embeddings.<n>Experiments on MMEB-V1 demonstrate that CoCoA built upon Qwen2-VL and Qwen2.5-VL significantly improves embedding quality.
arXiv Detail & Related papers (2026-03-02T05:34:45Z)
More Images, More Problems? A Controlled Analysis of VLM Failure Modes [80.64323947730905]
Large Vision Language Models (LVLMs) have demonstrated remarkable capabilities, yet their proficiency in understanding and reasoning over multiple images remains largely unexplored.<n>We introduce MIMIC, a new benchmark designed to rigorously evaluate the multi-image capabilities of LVLMs.
arXiv Detail & Related papers (2026-01-12T18:45:13Z)
DCSCR: A Class-Specific Collaborative Representation based Network for Image Set Classification [14.11016012242278]
This paper proposes a novel few-shot ISC approach called Deep Class-specific Collaborative Representation (DCSCR) network.<n>DCSCR consists of a fully convolutional deep feature extractor module, a global feature learning module, and a class-specific collaborative representation-based metric learning module.<n>Experiments on several well-known few-shot ISC datasets demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2025-08-18T09:09:55Z)
PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z)
CMIP-CIL: A Cross-Modal Benchmark for Image-Point Class Incremental Learning [10.936166435599572]
Image-point class incremental learning helps the 3D-points-vision robots continually learn category knowledge from 2D images.<n>We first explore this cross-modal task, proposing a benchmark CMIP-CIL and relieving the cross-modal catastrophic forgetting problem.<n>It employs masked point clouds and rendered multi-view images within a contrastive learning framework in pre-training, empowering the vision model with the generalizations of image-point correspondence.
arXiv Detail & Related papers (2025-04-11T10:28:29Z)
PROMPT-IML: Image Manipulation Localization with Pre-trained Foundation Models Through Prompt Tuning [35.39822183728463]
We present a novel Prompt-IML framework for detecting tampered images. Humans tend to discern authenticity of an image based on semantic and high-frequency information. Our model can achieve better performance on eight typical fake image datasets.
arXiv Detail & Related papers (2024-01-01T03:45:07Z)
Improving Human-Object Interaction Detection via Virtual Image Learning [68.56682347374422]
Human-Object Interaction (HOI) detection aims to understand the interactions between humans and objects. In this paper, we propose to alleviate the impact of such an unbalanced distribution via Virtual Image Leaning (VIL) A novel label-to-image approach, Multiple Steps Image Creation (MUSIC), is proposed to create a high-quality dataset that has a consistent distribution with real images.
arXiv Detail & Related papers (2023-08-04T10:28:48Z)
MOCA: Self-supervised Representation Learning by Predicting Masked Online Codebook Assignments [72.6405488990753]
Self-supervised learning can be used for mitigating the greedy needs of Vision Transformer networks. We propose a single-stage and standalone method, MOCA, which unifies both desired properties. We achieve new state-of-the-art results on low-shot settings and strong experimental results in various evaluation protocols.
arXiv Detail & Related papers (2023-07-18T15:46:20Z)
Domain Generalization for Mammographic Image Analysis with Contrastive Learning [62.25104935889111]
The training of an efficacious deep learning model requires large data with diverse styles and qualities. A novel contrastive learning is developed to equip the deep learning models with better style generalization capability. The proposed method has been evaluated extensively and rigorously with mammograms from various vendor style domains and several public datasets.
arXiv Detail & Related papers (2023-04-20T11:40:21Z)
Multimodal Contrastive Training for Visual Representation Learning [45.94662252627284]
We develop an approach to learning visual representations that embraces multimodal data. Our method exploits intrinsic data properties within each modality and semantic information from cross-modal correlation simultaneously. By including multimodal training in a unified framework, our method can learn more powerful and generic visual features.
arXiv Detail & Related papers (2021-04-26T19:23:36Z)
Learning Deformable Image Registration from Optimization: Perspective, Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation. We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.