Fugu-MT 論文翻訳(概要): DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

論文の概要: DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

arxiv url: http://arxiv.org/abs/2605.12960v1
Date: Wed, 13 May 2026 03:50:54 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.794114
Title: DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging
Title（参考訳）: DiM\textsuperscript{3}:方向とマグニチュード・アウェア・マージによる多言語・多モーダルモデルのブリッジング
Authors: Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze,
Abstract要約: 方向認識型マルチモーダルマージ(DiM3)を提案する。 LLaVAとQwenベースのバックボーンをまたいだ57言語をカバーする、テキストのみおよびビジョン言語設定のマルチ言語ベンチマークの実験。
参考スコア（独自算出の注目度）: 60.709970092170074
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.
Abstract（参考訳）: しかし、既存のマルチモーダルモデルを多くの言語に拡張するには、高コストな多言語マルチモーダルデータ構築と繰り返しのエンドツーエンド再トレーニングが必要である。既存のマルチモーダルモデルに多言語機能を注入し、共有言語モデルバックボーンの残余更新を構成する。重要な課題は、多言語とマルチモーダルの更新が異質であり、共有モデルにおける異なる機能的役割を反映していることである。そこで本研究では,ディビジョンエンコーダとマルチモーダルプロジェクタを保存しながら,各パラメータ次元の2つの更新を選択的に構成する方向対応多言語マルチモーダルマージ(DiM3)を提案する。 LLaVAとQwenベースのバックボーンにまたがる57言語をカバーする、テキストのみおよび視覚言語両方のマルチ言語ベンチマークの実験では、DiM3は既存のマージベースラインを一貫して上回り、元のマルチモーダルモデルよりも大幅に性能を向上し、多言語マルチモーダル微調整と競合し続けている。さらに、すでに訓練済みの多言語マルチモーダルモデルにDiM3を直接適用でき、さらなる利得が得られることを示す。さらなる解釈可能性分析により、DiM3は主に中間層セマンティック表現を再現し、高層タスクセンシティブな構造を維持しながら、テキストのみおよびマルチモーダル入力の言語間アライメントを強化することが示されている。私たちのリポジトリはhttps://github.com/wzj1718/DiM3にあります。

論文の概要: DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

関連論文リスト