Fugu-MT 論文翻訳(概要): One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

論文の概要: One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

arxiv url: http://arxiv.org/abs/2606.08126v1
Date: Sat, 06 Jun 2026 12:10:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-09 14:42:05.858614
Title: One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling
Title（参考訳）: 1羽, 3羽の鳥:多VLM選択, 適応, 組み立てのための自己適応的最適輸送
Authors: Qiyu Xu, Zhanxuan Hu, Yu Duan, Yonghang Tai, Huafeng Li, Quanxue Gao, Xiangyong Cao,
Abstract要約: 視覚言語モデル(VLM)はセマンティッククラス記述からの視覚的認識を可能にする。ほとんどのデプロイメントパイプラインは単一のVLMを選択し、そのモデルをラベル付けされていないターゲットセットに適合させる。このシングルバックボーンのパラダイムは、選択されたVLMが既にターゲットドメインと互換性があるという重要な仮定を隠している。自己適応型最適輸送に基づくトレーニングフリーフレームワークであるOne Stone, Three Birdsを提案する。
参考スコア（独自算出の注目度）: 42.03768283063
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) enable visual recognition from semantic class descriptions, which makes them attractive when target annotations are scarce or unavailable. Most deployment pipelines, however, first choose a single VLM and then adapt that model to the unlabeled target set. This single-backbone paradigm hides a critical assumption: the selected VLM is already compatible with the target domain. In realistic cross-domain deployment, several general-purpose and domain-specialized VLMs may be plausible, yet no instance-level target labels are available to identify the reliable ones. Deployment therefore requires a coupled solution for model selection, target adaptation, and prediction integration. We revisit this problem from a system-level multi-VLM perspective. Our central observation is that the three decisions above depend on the same latent object: a trustworthy sample-class structure in the target set. Different VLMs may encode different transfer biases and produce conflicting predictions, but their outputs can still provide complementary evidence for estimating this structure. We propose One Stone, Three Birds, a training-free framework based on self-adaptive optimal transport. Given a pool of frozen candidate VLMs, OSTB estimates a consensus sample-to-class transport plan without updating VLM parameters. The learned transport structure is then reused for all deployment objectives: model selection is performed by ranking the combined semantic and visual reliability induced by the consensus plan; target adaptation is obtained by fitting transport-conditioned visual classifiers; and ensembling is implemented through reliability-aware probabilistic integration. Extensive experiments on natural-image, remote-sensing, and medical-pathology benchmarks show that OSTB improves model ranking, adaptation stability, and ensemble robustness under heterogeneous candidate pools.
Abstract（参考訳）: 視覚言語モデル(VLM)はセマンティッククラス記述からの視覚的認識を可能にする。しかしながら、ほとんどのデプロイメントパイプラインは、まず1つのVLMを選択し、そのモデルをラベル付けされていないターゲットセットに適応させる。このシングルバックボーンのパラダイムは、選択されたVLMが既にターゲットドメインと互換性があるという重要な仮定を隠している。現実的なクロスドメインデプロイメントでは、いくつかの汎用VLMとドメイン特化VLMが妥当であるが、信頼性の高いVLMを識別するためのインスタンスレベルのターゲットラベルは存在しない。したがって、配置にはモデル選択、ターゲット適応、予測統合のための結合したソリューションが必要である。この問題をシステムレベルのマルチVLMの観点から再考する。私たちの中心的な観察では、上記の3つの決定は、同じ潜在オブジェクト、すなわち、ターゲットセットの信頼できるサンプルクラス構造に依存している。異なるVLMは異なる転送バイアスを符号化し、矛盾する予測を生成するが、その出力はこの構造を推定するための補完的な証拠を与えることができる。自己適応型最適輸送に基づくトレーニングフリーフレームワークであるOne Stone, Three Birdsを提案する。凍結候補VLMのプールが与えられた場合、OSTBはVLMパラメータを更新することなく、コンセンサスサンプルからクラスへのトランスポートプランを推定する。モデル選択は、コンセンサス計画によって誘導されるセマンティックと視覚的信頼性の組み合わせをランク付けすることで行われ、目標適応は、輸送条件付き視覚分類器を適合させて行われ、信頼性に配慮した確率的統合によって実装される。自然画像、リモートセンシング、医療病理のベンチマークに関する大規模な実験は、OSTBが不均一な候補プール下でのモデルランキング、適応安定性、アンサンブルロバスト性を改善することを示している。

論文の概要: One Stone, Three Birds: Self-adaptive Optimal Transport for Multi-VLM Selection, Adaptation, and Ensembling

関連論文リスト