Fugu-MT 論文翻訳(概要): MAny: Merge Anything for Multimodal Continual Instruction Tuning

論文の概要: MAny: Merge Anything for Multimodal Continual Instruction Tuning

arxiv url: http://arxiv.org/abs/2604.14016v1
Date: Wed, 15 Apr 2026 15:57:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-16 20:38:32.621775
Title: MAny: Merge Anything for Multimodal Continual Instruction Tuning
Title（参考訳）: MAny:マルチモーダルなインストラクションチューニングのためのマージ
Authors: Zijian Gao, Wangwang Jia, Xingxing Zhang, Pengfei Qian, Tao Sun, Bo Ding, Yong Dou, Huaimin Wang, Kele Xu,
Abstract要約: textbfMAny(textbfMAny)は、textbfCross-modal textbfProjection textbfMergingを通じてタスク固有の知識を統合するフレームワークである。 textbfLow-rank textbfParameter textbfMerging (textbfLPM)
参考スコア（独自算出の注目度）: 52.50936513604062
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Continual Instruction Tuning (MCIT) is essential for sequential task adaptation of Multimodal Large Language Models (MLLMs) but is severely restricted by catastrophic forgetting. While existing literature focuses on the reasoning language backbone, in this work, we expose a critical yet neglected dual-forgetting phenomenon across both perception drift in Cross-modal Projection Space and reasoning collapse in Low-rank Parameter Space. To resolve this, we present \textbf{MAny} (\textbf{M}erge \textbf{Any}thing), a framework that merges task-specific knowledge through \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) and \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}). Specifically, CPM recovers perceptual alignment by adaptively merging cross-modal visual representations via visual-prototype guidance, ensuring accurate feature recovery during inference. Simultaneously, LPM eliminates mutual interference among task-specific low-rank modules by recursively merging low-rank weight matrices. By leveraging recursive least squares, LPM provides a closed-form solution that mathematically guarantees an optimal fusion trajectory for reasoning stability. Notably, MAny operates as a training-free paradigm that achieves knowledge merging via efficient CPU-based algebraic operations, eliminating additional gradient-based optimization beyond initial tuning. Our extensive evaluations confirm the superior performance and robustness of MAny across multiple MLLMs and benchmarks. Specifically, on the UCIT benchmark, MAny achieves significant leads of up to 8.57\% and 2.85\% in final average accuracy over state-of-the-art methods across two different MLLMs, respectively.
Abstract（参考訳）: マルチモーダル大規模言語モデル(MLLM)の逐次タスク適応には,MCIT(Multimodal Continual Instruction Tuning)が不可欠である。既存の文献は、推論言語のバックボーンに焦点を当てているが、本研究では、クロスモーダル射影空間における知覚のドリフトと低ランクパラメータ空間における推論の崩壊の両方に、批判的かつ無視された二重鍛造現象を明らかにする。これを解決するために、タスク固有の知識を、 \textbf{C}ross-modal \textbf{P}rojection \textbf{M}erging (\textbf{CPM}) と \textbf{L}ow-rank \textbf{P}arameter \textbf{M}erging (\textbf{LPM}) を通じて統合するフレームワークである。具体的には、CPMは、視覚-プロトタイプ誘導を介して、視覚を適応的に統合することで知覚的アライメントを回復し、推論中の正確な特徴回復を確実にする。同時に、LPMは、低ランクの重み行列を再帰的にマージすることによって、タスク固有の低ランクモジュール間の相互干渉を除去する。再帰的最小二乗を利用して、LPMは数学的に推論安定性のための最適融合軌道を保証する閉形式解を提供する。特にMAnyは、CPUベースの効率的な代数演算を通じて知識マージを実現するトレーニングフリーパラダイムとして機能し、初期チューニング以上のグラデーションベースの最適化を不要にしている。複数のMLLMおよびベンチマークでMAnyの優れた性能とロバスト性を確認した。具体的には、UCITベンチマークにおいて、MAnyは2つの異なるMLLMにわたる最先端の手法に対する最終平均精度において、最大8.57\%と2.85\%の有意なリードを達成している。

論文の概要: MAny: Merge Anything for Multimodal Continual Instruction Tuning

関連論文リスト