Fugu-MT 論文翻訳(概要): Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

論文の概要: Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

arxiv url: http://arxiv.org/abs/2605.10765v1
Date: Mon, 11 May 2026 15:59:06 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.962443
Title: Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning
Title（参考訳）: マルチモーダルインストラクションチューニングのための動的クロスモーダルプロンプト生成
Authors: Tao Hu, Da-Wei Zhou,
Abstract要約: DRAPEは、MCITのために連続インスタンス固有のソフトプロンプトを合成するプロンプト学習フレームワークである。 DRAPEは、代表的なプロンプトベースとLoRAベースの連続学習ベースライン間の最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 13.499744113926505
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Large Language Models (MLLMs) achieve strong performance through instruction tuning, yet real-world deployment often requires continual capability expansion across sequential tasks. In such scenarios, Multimodal Continual Instruction Tuning (MCIT) aims to acquire new capabilities while limiting catastrophic forgetting. Existing methods mainly follow a module-composition paradigm: they maintain task-level prompts or LoRA experts and dynamically route or aggregate a subset of them at inference. However, samples within the same task can still differ substantially in visual scenes, question intents, and reasoning demands. This motivates instance-level adaptation to individual query-image pairs rather than only selecting or combining task-level modules. To this end, we propose DRAPE (Dynamic Cross-Modal Prompt Generation), a prompt-learning framework that synthesizes continuous instance-specific soft prompts for MCIT. Instead of selecting prompts from a fixed pool, DRAPE derives prompt queries from the textual instruction and cross-attends to visual patch features, producing query-image conditioned prompts that are prepended to the frozen LLM. To mitigate forgetting during sequential updates, DRAPE applies null-space gradient projection to the shared projector and uses CLIP-based prototype routing for task-label-free generator selection at inference. Extensive experiments on MCIT benchmarks show that DRAPE achieves state-of-the-art performance among representative prompt-based and LoRA-based continual-learning baselines.
Abstract（参考訳）: MLLM(Multimodal Large Language Models)は、命令チューニングによって高いパフォーマンスを達成するが、現実のデプロイメントでは、シーケンシャルタスク間で連続的な機能拡張を必要とすることが多い。このようなシナリオでは、MCIT(Multimodal Continual Instruction Tuning)は、破滅的な忘れを抑えながら、新しい機能の獲得を目指している。既存のメソッドは主にモジュール構成パラダイムに従っており、タスクレベルのプロンプトやLoRAの専門家を維持し、推論時にそのサブセットを動的にルートまたは集約する。しかし、同じタスク内のサンプルは、視覚的なシーン、質問意図、推論要求において、依然として大きく異なる可能性がある。これは、タスクレベルのモジュールを選択したり組み合わせたりするのではなく、個々のクエリイメージペアへのインスタンスレベルの適応を動機付ける。そこで我々は,MCITのための連続インスタンス固有のソフトプロンプトを合成するプロンプト学習フレームワークであるDRAPE(Dynamic Cross-Modal Prompt Generation)を提案する。固定プールからプロンプトを選択する代わりに、DRAPEはテキスト命令とクロスアタッチメントからビジュアルパッチ機能へのプロンプトクエリを導出し、フリーズされたLLMにプリフィックスされたクエリ-イメージ条件付きプロンプトを生成する。逐次更新時の忘れを緩和するため、DRAPEは共有プロジェクタにnull空間勾配プロジェクションを適用し、CLIPベースのプロトタイプルーティングを使用して推論時にタスクラベルなしジェネレータの選択を行う。 MCITベンチマークの大規模な実験は、DRAPEが代表的プロンプトベースとLoRAベースの連続学習ベースラインで最先端のパフォーマンスを達成することを示している。

論文の概要: Dynamic Cross-Modal Prompt Generation for Multimodal Continual Instruction Tuning

関連論文リスト