Fugu-MT 論文翻訳(概要): Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

論文の概要: Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

arxiv url: http://arxiv.org/abs/2603.13341v1
Date: Sat, 07 Mar 2026 03:59:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:57.785918
Title: Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning
Title（参考訳）: ソースレスクロスドメイン・ファウショット学習における識別可能性のトラップを意識する
Authors: Zhenyu Zhang, Yixiong Zou, Yuhua Li, Ruixuan Li, Guangyao Chen,
Abstract要約: Source-Free Cross-Domain Few-Shot Learningは、ターゲットドメインからの限られたトレーニングデータによる微調整に焦点を当てている。視覚的識別性の向上は実際にVLMの性能を抑制する。まず、モデルを誘導し、モーダル間のアライメントに焦点を合わせるために、視覚学習を摂動させるアプローチを提案する。
参考スコア（独自算出の注目度）: 30.80780619903459
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL) focuses on fine-tuning with limited training data from target domains (e.g., medical or satellite images), where Vision-Language Models (VLMs) such as CLIP and SigLIP have shown promising results. Current works in traditional visual models suggest that improving visual discriminability enhances performance. However, in VLM-based SF-CDFSL tasks, we find that \textbf{strengthening visual-modal discriminability actually suppresses VLMs' performance}. In this paper, we aim to delve into this phenomenon for an interpretation and a solution. By both theoretical and experimental proofs, our study reveals that fine-tuning with the typical cross-entropy loss ($\mathcal{L}_{\mathrm{vlm}}$) inherently includes a visual learning part and a cross-modal learning part, where the cross-modal part is crucial for rectifying the heavily disrupted modality misalignment in SF-CDFSL. However, we find that the visual learning essentially acts as a shortcut that encourages the model to reduce $\mathcal{L}_{\mathrm{vlm}}$ without considering the cross-modal part, therefore hindering the cross-modal alignment and harming the performance. Based on this interpretation, we further propose an approach to address this problem: first, we perturb the visual learning to guide the model to focus on the cross-modal alignment. Then, we use the visual-text semantic relationships to gradually align the visual and textual modalities during the fine-tuning. Extensive experiments on various settings, backbones (CLIP, SigLip, PE-Core), and tasks (4 CDFSL datasets and 11 FSL datasets) show that we consistently set new state-of-the-art results. Code is available at https://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trap.
Abstract（参考訳）: Source-Free Cross-Domain Few-Shot Learning (SF-CDFSL)は、ターゲットドメイン(例えば、医療や衛星画像)からの限られたトレーニングデータによる微調整に焦点を当てており、CLIPやSigLIPのようなビジョン・ランゲージ・モデル(VLM)が有望な結果を示している。従来の視覚モデルにおける現在の研究は、視覚的識別性の改善がパフォーマンスを向上させることを示唆している。しかし, VLM ベースの SF-CDFSL タスクでは, 視覚的モーダル識別能力の強化は VLM の性能を実際に抑制している。本稿では,この現象を解釈と解法として掘り下げることを目的とする。理論的および実験的な証明により,典型的なクロスエントロピー損失(\mathcal{L}_{\mathrm{vlm}}$)による微調整は本質的には視覚学習部とクロスモーダル学習部を含むことが明らかとなった。しかし、視覚学習は本質的にショートカットとして機能し、クロスモーダル部分を考えることなく$\mathcal{L}_{\mathrm{vlm}}$を減らし、したがってクロスモーダルアライメントを妨げ、性能を損なう。この解釈に基づいて、我々はこの問題に対処するためのアプローチをさらに提案する: まず、モデルをモダル間のアライメントに焦点を合わせるために視覚学習を摂動させる。そして,視覚とテキストのセマンティックな関係を利用して,微調整中の視覚とテキストのモダリティを徐々に調整する。さまざまな設定、バックボーン(CLIP, SigLip, PE-Core)、タスク(4つのCDFSLデータセットと11のFSLデータセット)に関する大規模な実験は、新しい最先端の結果を一貫して設定していることを示しています。コードはhttps://github.com/zhenyuZ-HUST/CVPR26-Mind-the-Discriminability-Trapで公開されている。

論文の概要: Mind the Discriminability Trap in Source-Free Cross-domain Few-shot Learning

関連論文リスト