Fugu-MT 論文翻訳(概要): DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

論文の概要: DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

arxiv url: http://arxiv.org/abs/2602.00795v1
Date: Sat, 31 Jan 2026 16:09:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.406229
Title: DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning
Title（参考訳）: DVLA-RL:Few-Shot Learningのための強化学習ゲーティングを用いたデュアルレベル視覚言語アライメント
Authors: Wenhao Li, Xianjing Meng, Qiangchang Wang, Zhongyi Han, Zhibin Wu, Yilong Yin,
Abstract要約: 少数のサンプルしか持たない新しいカテゴリーに一般化することを目的としている。最近のアプローチでは、クラス名から派生したセマンティックな埋め込みで視覚表現を豊かにするために、大きな言語モデルが組み込まれている。強化学習ゲーティング(DVLA-RL)を用いたデュアルレベル視覚言語アライメントを提案する。
参考スコア（独自算出の注目度）: 53.36809572236361
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Few-shot learning (FSL) aims to generalize to novel categories with only a few samples. Recent approaches incorporate large language models (LLMs) to enrich visual representations with semantic embeddings derived from class names. However, they overlook progressive and adaptive alignment between vision and language from low-level to high-level semantics, resulting in limited semantic gains. To address these challenges, we propose Dual-level Vision-Language Alignment with Reinforcement Learning gating (DVLA-RL), which consists of Dual-level Semantic Construction (DSC) and RL-gated Attention (RLA). Specifically, DSC conditions LLMs on both class names and support samples to generate discriminative attributes, progressively selects the most relevant ones, and then synthesizes them into coherent class descriptions. This process provides complementary low-level attributes and high-level descriptions, enabling both fine-grained grounding and holistic class understanding. To dynamically integrate dual-level semantics along with the visual network layers, RLA formulates cross-modal fusion as a sequential decision process. A lightweight policy trained with episodic REINFORCE adaptively adjusts the contributions of self-attention and cross-attention to integrate textual and visual tokens. As a result, shallow layers refine local attributes and deep layers emphasize global semantics, enabling more precise cross-modal alignment. This achieves class-specific discrimination and generalized representations with merely a few support samples. DVLA-RL achieves new state-of-the-art performance across nine benchmarks in three diverse FSL scenarios.
Abstract（参考訳）: FSL (Few-shot Learning) は、少数のサンプルしか持たない新しいカテゴリに一般化することを目的としている。最近のアプローチでは、クラス名から派生したセマンティック埋め込みで視覚表現を豊かにするために、大きな言語モデル(LLM)が組み込まれている。しかし、彼らは視覚と言語の間の進歩的かつ適応的なアライメントを低レベルから高レベルのセマンティクスから見落としており、結果としてセマンティクスの利得は限られている。これらの課題に対処するため、DVLA-RL(Dual-level Semantic Construction)とRL-gated Attention(RLA)からなるDVLA-RL(Dual-level Vision-Language Alignment with Reinforcement Learning Gaating)を提案する。具体的には、DSC条件のLCMをクラス名とサポートサンプルの両方に設定し、識別属性を生成し、最も関連性の高い属性を徐々に選択し、それらを一貫性のあるクラス記述に合成する。このプロセスは、補足的な低レベル属性と高レベル記述を提供し、きめ細かい接地と全体論的クラス理解の両方を可能にします。視覚ネットワーク層とデュアルレベルセマンティクスを動的に統合するために、RLAはシーケンシャルな決定プロセスとしてクロスモーダル融合を定式化する。エピソードなREINFORCEで訓練された軽量なポリシーは、テキストトークンとビジュアルトークンを統合するために、自己注意と相互注意の貢献を適応的に調整する。その結果、浅い層は局所的な属性を洗練し、深い層はグローバルなセマンティクスを強調し、より正確なクロスモーダルアライメントを可能にする。これは、いくつかのサポートサンプルだけで、クラス固有の識別と一般化された表現を実現する。 DVLA-RLは、9つのベンチマークで3つの異なるFSLシナリオで、最先端のパフォーマンスを新たに達成する。

論文の概要: DVLA-RL: Dual-Level Vision-Language Alignment with Reinforcement Learning Gating for Few-Shot Learning

関連論文リスト