Fugu-MT 論文翻訳(概要): Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

論文の概要: Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

arxiv url: http://arxiv.org/abs/2605.01325v1
Date: Sat, 02 May 2026 08:42:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-05 20:33:49.707489
Title: Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance
Title（参考訳）: Gromov-Wasserstein距離レンズによるVLMのモデル選択の再考
Authors: Muyang Li, Yucheng Liu, Jianbo Ma, Elliot Osborne, Bo Han, Tongliang Liu,
Abstract要約: VLM(Vision-Language Models)は、視覚エンコーダの統合により、視覚機能を備えた従来のLLMを拡張した。最大サイズまたは最高ゼロショット精度のエンコーダを選択するような一般的なプラクティスは、常に最適なモデルを特定するのに失敗していることを示す。 VLMでは視覚エンコーダのどの要素が重要か?
参考スコア（独自算出の注目度）: 65.36257254806647
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have enhanced traditional LLMs with visual capabilities through the integration of vision encoders. While recent works have explored various combinations of vision encoders and LLMs, there still lacks a principled understanding of what makes a vision encoder suitable for VLM alignment. In this paper, we systematically investigate this question via comprehensive experiments on a curated collection of 19 pre-trained vision encoders from diverse sources. We first demonstrate that common practices, such as choosing encoders with the largest size or highest zero-shot accuracy, consistently fail to identify optimal models. In fact, these metrics show only weak to moderate correlation with VLM performance. This intriguing finding begs a fundamental question: What factors of vision-encoders matter in VLM? Through comprehensive analysis, we identify that the structural similarity across modalities plays a crucial but previously overlooked role in vision-encoder selection, which we measure using the Gromov-Wasserstein distance as a proxy. From a theoretical perspective, we show that the learnability of cross-modality mapping can be provably associated with the Gromov-Wasserstein distance. Empirical verification on 60+ full VLM training runs shows that our proposed inference-only metric performs significantly better than alternative model selection strategies and exhibits a much stronger correlation with final VLM performance, thereby enabling efficient and effective prediction of VLM performance before full training.
Abstract（参考訳）: VLM(Vision-Language Models)は、視覚エンコーダの統合により、視覚機能を備えた従来のLLMを拡張した。近年の研究では、視覚エンコーダとLCMの組み合わせについて検討されているが、VLMアライメントに適した視覚エンコーダの原理的な理解はいまだ残っていない。本稿では,多様な情報源から収集した19個の事前学習型視覚エンコーダの収集に関する総合的な実験を通じて,この問題を系統的に検討する。まず、最大サイズまたは最高ゼロショット精度のエンコーダを選択するなどの一般的なプラクティスが、常に最適なモデルを特定するのに失敗することを実証する。実際、これらの指標は、VLMのパフォーマンスと弱さと中程度の相関しか示さない。 VLMでは視覚エンコーダのどの要素が重要か? 包括的解析により、モダリティ間の構造的類似性は視覚エンコーダ選択において重要であるが、以前は見過ごされていた役割を担っており、Gromov-Wasserstein距離をプロキシとして測定する。理論的な観点から、モダリティ写像の学習性はグロモフ・ワッサーシュタイン距離と証明可能に関連付けられることを示す。 60以上のフルVLMトレーニング実行における実証的検証により,提案手法は代替モデル選択手法よりもはるかに優れた性能を示し,最終的なVLM性能との相関が強く,フルトレーニング前のVLM性能の効率的かつ効果的な予測が可能となった。

論文の概要: Rethinking Model Selection in VLM Through the Lens of Gromov-Wasserstein Distance

関連論文リスト