Fugu-MT 論文翻訳(概要): SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

論文の概要: SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

arxiv url: http://arxiv.org/abs/2603.21584v1
Date: Mon, 23 Mar 2026 05:24:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.500481
Title: SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models
Title（参考訳）: SSAM:マルチモーダル大言語モデルの統合のための特異部分空間アライメント
Authors: Md Kaykobad Reza, Ameya Patil, Edward Ayrapetian, M. Salman Asif,
Abstract要約: トレーニングフリーモデルマージフレームワークとして,SSAM(Singular Subspace Alignment and Merging)を提案する。 SSAMは独立に訓練された専門家MLLMを、入力モダリティの組み合わせを扱える単一のモデルに統一する。マルチモーダルなトレーニングデータを使用しないSSAMは、4つのデータセットで最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 15.489426398410346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models (MLLMs) achieve strong performance by jointly processing inputs from multiple modalities, such as vision, audio, and language. However, building such models or extending them to new modalities often requires large paired datasets and substantial computational resources. Since many pretrained MLLMs (e.g., vision-language or audio-language) are publicly available, we ask whether we can merge them into a single MLLM that can handle multiple modalities? Merging MLLMs with different input modalities remains challenging, partly because of differences in the learned representations and interference between their parameter spaces. To address these challenges, we propose Singular Subspace Alignment and Merging (SSAM), a training-free model merging framework that unifies independently trained specialist MLLMs into a single model capable of handling any combination of input modalities. SSAM maintains modality-specific parameter updates separately and identifies a shared low-rank subspace for language-related parameter updates, aligns them within this subspace, and merges them to preserve complementary knowledge while minimizing parameter interference. Without using any multimodal training data, SSAM achieves state-of-the-art performance across four datasets, surpassing prior training-free merging methods and even jointly trained multimodal models. These results demonstrate that aligning models in parameter space provides a scalable and resource-efficient alternative to conventional joint multimodal training.
Abstract（参考訳）: マルチモーダル大言語モデル(MLLM)は、視覚、音声、言語などの複数のモーダルから入力を共同処理することで、高い性能を達成する。しかし、そのようなモデルを構築したり、新しいモダリティに拡張するためには、大きなペアのデータセットとかなりの計算資源が必要になることが多い。多くの事前訓練されたMLLM(例えば視覚言語や音声言語)が公開されているので、複数のモーダルを扱える単一のMLLMにマージできるかどうかを問う。 MLLMを異なる入力モードでマージすることは、学習された表現の違いとパラメータ空間間の干渉のために、依然として難しい。これらの課題に対処するために,SSAM(Singular Subspace Alignment and Merging)を提案する。SSAMは,独立に訓練されたMLLMを入力の任意の組み合わせを扱える単一のモデルに統一する,トレーニングフリーのモデルマージフレームワークである。 SSAMは、モダリティ固有のパラメータ更新を別々に維持し、言語関連のパラメータ更新のための共有低ランクサブスペースを特定し、これらをサブスペース内で整列させ、パラメータ干渉を最小限にしながら補完的な知識を維持するためにマージする。マルチモーダルトレーニングデータを使用しないSSAMは、4つのデータセットにわたる最先端のパフォーマンスを達成し、事前のトレーニングなしマージメソッドや、共同でトレーニングされたマルチモーダルモデルさえ超えている。これらの結果は、パラメータ空間における整列モデルが、従来のジョイントマルチモーダルトレーニングに代わるスケーラブルで資源効率の高い代替手段を提供することを示した。

論文の概要: SSAM: Singular Subspace Alignment for Merging Multimodal Large Language Models

関連論文リスト