Fugu-MT 論文翻訳(概要): Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

論文の概要: Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

arxiv url: http://arxiv.org/abs/2603.17655v1
Date: Wed, 18 Mar 2026 12:20:21 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.690532
Title: Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment
Title（参考訳）: 修正対象領域局所アライメントを用いた解釈可能なクロスドメインFew-Shot学習
Authors: Yaze Zhao, Yixiong Zou, Yuhua Li, Ruixuan Li,
Abstract要約: Cross-Domain Few-Shot Learningは、大規模な汎用データ(ソースドメイン)でトレーニングされたモデルを、少ないトレーニングデータだけで、下流のターゲットドメインに適応させる。 CLIPモデルは、解釈可能な認識のためのきめ細かい視覚的手がかりにはほとんど焦点を合わせられない。この問題に対処するために、局所的な視覚的特徴とテキスト意味論の整合性に監督が欠如しているため、私たちは自己監督情報に目を向ける。
参考スコア（独自算出の注目度）: 19.113214017897118
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-Domain Few-Shot Learning (CDFSL) adapts models trained with large-scale general data (source domain) to downstream target domains with only scarce training data, where the research on vision-language models (e.g., CLIP) is still in the early stages. Typical downstream domains, such as medical diagnosis, require fine-grained visual cues for interpretable recognition, but we find that current fine-tuned CLIP models can hardly focus on these cues, albeit they can roughly focus on important regions in source domains. Although current works have demonstrated CLIP's shortcomings in capturing local subtle patterns, in this paper, we find that the domain gap and scarce training data further exacerbate such shortcomings, much more than that of holistic patterns, which we call the local misalignment problem in CLIP-based CDFSL. To address this problem, due to the lack of supervision in aligning local visual features and text semantics, we turn to self-supervision information. Inspired by the translation task, we propose the CC-CDFSL method with cycle consistency, which translates local visual features into text features and then translates them back into visual features (and vice versa), and constrains the original features close to the translated back features. To reduce the noise imported by richer information in the visual modality, we further propose a Semantic Anchor mechanism, which first augments visual features to provide a larger corpus for the text-to-image mapping, and then shrinks the image features to filter out irrelevant image-to-text mapping. Extensive experiments on various benchmarks, backbones, and fine-tuning methods show we can (1) effectively improve the local vision-language alignment, (2) enhance the interpretability of learned patterns and model decisions by visualizing patches, and (3) achieve state-of-the-art performance.
Abstract（参考訳）: Cross-Domain Few-Shot Learning (CDFSL) は、大規模な汎用データ(ソースドメイン)でトレーニングされたモデルを、限られたトレーニングデータだけで下流のターゲットドメインに適応させ、視覚言語モデル(例えばCLIP)の研究はまだ初期段階にある。医学診断などの典型的な下流領域では、解釈可能な認識にはきめ細かな視覚的手がかりが必要であるが、現在の微調整のCLIPモデルはこれらの手がかりにはほとんど焦点を当てることができない。局所的な微妙なパターンを捉える上でのCLIPの欠点は,本論文では,CLIPをベースとしたCDFSLにおける局所的なミスアライメント問題と呼ばれる全体的パターンよりも,ドメインギャップと不足したトレーニングデータがさらに悪化していることが確認されている。この問題に対処するために、局所的な視覚的特徴とテキスト意味論の整合性に監督が欠如しているため、私たちは自己監督情報に目を向ける。翻訳タスクに触発されたCC-CDFSL法は,局所的な視覚的特徴をテキストの特徴に翻訳し,それらを視覚的特徴(およびその逆も)に変換し,元の特徴を翻訳後の特徴に近いものに制約する。視覚的モダリティにおいて、よりリッチな情報によって輸入されるノイズを低減するために、まず視覚的特徴を増強し、テキスト間マッピングのためのより大きなコーパスを提供するセマンティックアンカー機構を提案し、その後、画像特徴を縮小して、無関係な画像間マッピングをフィルタリングする。様々なベンチマーク,バックボーン,微調整手法の広範な実験により,(1)局所的な視覚言語アライメントを効果的に改善し,(2)学習パターンの解釈可能性を高め,(2)パッチの可視化によるモデル決定を向上し,(3)最先端のパフォーマンスを達成できることが示されている。

論文の概要: Interpretable Cross-Domain Few-Shot Learning with Rectified Target-Domain Local Alignment

関連論文リスト