Fugu-MT 論文翻訳(概要): Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

論文の概要: Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

arxiv url: http://arxiv.org/abs/2603.12669v1
Date: Fri, 13 Mar 2026 05:25:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-16 17:38:11.922291
Title: Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning
Title（参考訳）: 効率的な視覚推論のためのVLMの高機能融合の視覚的検証
Authors: Selim Furkan Tekin, Yichang Xu, Gaowen Liu, Ramana Rao Kompella, Margaret L. Loper, Ling Liu,
Abstract要約: 視覚言語モデル(VLM)間の相補的推論を捉えるために焦点誤差の多様性を導入する。核融合性能に価値を付加しない成分VLMを抽出するために遺伝的アルゴリズムを適用した。我々のV3フュージョンアプローチは、視覚言語推論のための高性能なデュアル焦点分散フュージョン予測を生成することができる。
参考スコア（独自算出の注目度）: 25.009382887048833
License: http://creativecommons.org/licenses/by/4.0/
Abstract: With the growing number and diversity of Vision-Language Models (VLMs), many works explore language-based ensemble, collaboration, and routing techniques across multiple VLMs to improve multi-model reasoning. In contrast, we address the diverse model selection using both vision and language modalities. We introduce focal error diversity to capture complementary reasoning across VLMs and a CKA-based focal diversity metric (CKA-focal) to measure disagreement in their visual embeddings. On the constructed ensemble surface from a pool of candidate VLMs, we applied a Genetic Algorithm to effectively prune out those component VLMs that do not add value to the fusion performance. We identify the best combination for each task as well as fuse the outputs of each VLMs in the model pool, and show that heterogeneous models can capture epistemic uncertainty dynamically and mitigate hallucinations. Our V3Fusion approach is capable of producing dual focal-diversity fused predictions with high performance for vision-language reasoning, even when there is no majority consensus or the majority of VLMs make incorrect predictions. Extensive experiments validate V3Fusion on four popular VLM benchmarks (A-OKVQA, MMMU, MMMU-Pro, and OCR-VQA). The results show that V3Fusion outperforms the best-performing VLM on MMMU by 8.09% and MMMU-Pro by 4.87% gain in accuracy. For generative tasks, V3Fusion outperforms Intern-VL2-8b and Qwen2.5-VL-7b, the top-2 VLM performers on both A-OKVQA and OCR-VQA. Our code and datasets are available at https://github.com/sftekin/v3fusion.
Abstract（参考訳）: VLM(Vision-Language Models)の数と多様性により、複数のVLM間の言語ベースのアンサンブル、コラボレーション、ルーティング技術を探究し、マルチモデル推論を改善する。対照的に、視覚と言語の両方を用いた多様なモデル選択に対処する。我々は,VLM間の相補的推論を捉えるために焦点誤差の多様性を導入し,視覚的埋め込みにおける不一致を測定するために,CKAに基づく焦点偏差測定(CKA焦点)を導入した。候補VLMのプールから構築したアンサンブル面に遺伝的アルゴリズムを適用し,融合性能に価値を与えない成分VLMを効果的に抽出した。モデルプール内の各VLMの出力を融合させるだけでなく、各タスクに最適な組み合わせを同定し、異種モデルがてんかんの不確実性を動的に捉え、幻覚を緩和できることを示す。我々のV3Fusionアプローチは、大半が一致していない場合や、VLMの大多数が誤った予測を行う場合であっても、視覚言語推論のための高性能なデュアル焦点拡散予測を生成することができる。 V3Fusionを4つのVLMベンチマーク(A-OKVQA、MMMU、MMMU-Pro、OCR-VQA)で検証した。その結果,V3FusionはMMMUの最高性能VLMを8.09%,MMMU-Proを4.87%上回った。生成タスクでは、V3Fusionは、A-OKVQAとOCR-VQAの両方でトップ2のVLMパフォーマーであるIntern-VL2-8bとQwen2.5-VL-7bを上回っている。私たちのコードとデータセットはhttps://github.com/sftekin/v3fusion.comで公開されています。

論文の概要: Vision Verification Enhanced Fusion of VLMs for Efficient Visual Reasoning

関連論文リスト