Fugu-MT 論文翻訳(概要): Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

論文の概要: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

arxiv url: http://arxiv.org/abs/2506.11515v1
Date: Fri, 13 Jun 2025 07:16:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-16 17:50:49.687453
Title: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Title（参考訳）: マネージャ: 2-tower VLMとMLLMのユニモーダルエキスパートからの洞察を集約する
Authors: Xiao Xu, Libo Qin, Wanxiang Che, Min-Yen Kan,
Abstract要約: 2tower Vision-Language Models (VLM) は、様々な下流VLタスクに強い性能を示す。我々は,訓練済みの未学習専門家のさまざまなレベルからの洞察を適応的に集約する,軽量で効率的で効果的なプラグインであるManageerを提案する。
参考スコア（独自算出の注目度）: 61.903626952650605
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Two-Tower Vision--Language Models (VLMs) have demonstrated strong performance across various downstream VL tasks. While BridgeTower further enhances performance by building bridges between encoders, it \textit{(i)} suffers from ineffective layer-by-layer utilization of unimodal representations, \textit{(ii)} restricts the flexible exploitation of different levels of unimodal semantic knowledge, and \textit{(iii)} is limited to the evaluation on traditional low-resolution datasets only with the Two-Tower VLM architecture. In this work, we propose Manager, a lightweight, efficient and effective plugin that adaptively aggregates insights from different levels of pre-trained unimodal experts to facilitate more comprehensive VL alignment and fusion. First, under the Two-Tower VLM architecture, we introduce ManagerTower, a novel VLM that introduces the manager in each cross-modal layer. Whether with or without VL pre-training, ManagerTower outperforms previous strong baselines and achieves superior performance on 4 downstream VL tasks. Moreover, we extend our exploration to the latest Multimodal Large Language Model (MLLM) architecture. We demonstrate that LLaVA-OV-Manager significantly boosts the zero-shot performance of LLaVA-OV across different categories of capabilities, images, and resolutions on 20 downstream datasets, whether the multi-grid algorithm is enabled or not. In-depth analysis reveals that both our manager and the multi-grid algorithm can be viewed as a plugin that improves the visual representation by capturing more diverse visual details from two orthogonal perspectives (depth and width). Their synergy can mitigate the semantic ambiguity caused by the multi-grid algorithm and further improve performance. Code and models are available at https://github.com/LooperXX/ManagerTower.
Abstract（参考訳）: 2tower Vision-Language Models (VLM) は、様々な下流VLタスクに強い性能を示す。 BridgeTowerはエンコーダ間でブリッジを構築することでパフォーマンスをさらに向上するが、 \textit{ i)} 単項表現の非効率なレイヤ・バイ・レイヤ利用, \textit{ (ii)} は、異なるレベルのユニモーダル意味知識、および \textit{の柔軟な利用を制限する。 iii) は, 従来の低分解能データセットにおいて, 2-tower VLM アーキテクチャのみを用いた評価に限られる。本稿では,より包括的なVLアライメントと融合を促進するために,事前学習したさまざまな専門家のレベルからの洞察を適応的に集約する,軽量で効率的で効果的なプラグインであるManageerを提案する。まず、Two-Tower VLMアーキテクチャの下で、各クロスモーダル層にマネージャを導入する新しいVLMであるMan ManagerTowerを紹介する。 VL事前トレーニングの有無にかかわらず、ManageTowerは以前の強力なベースラインよりも優れ、4つの下流VLタスクで優れたパフォーマンスを実現している。さらに、我々は最新のMultimodal Large Language Model (MLLM)アーキテクチャへの探索を拡張した。 LLaVA-OV-Managerは、マルチグリッドアルゴリズムが有効か否かに関わらず、20の下流データセット上のさまざまな機能、画像、解像度のカテゴリにわたって、LLaVA-OVのゼロショット性能を著しく向上させることを示した。 In-deepth Analysisでは,2つの直交的視点(深度と幅)から,より多様な視覚的詳細を捉えることで視覚的表現を改善するプラグインとして,マネージャとマルチグリッドアルゴリズムの両方を見ることができることを明らかにした。それらのシナジーは、マルチグリッドアルゴリズムによって引き起こされる意味的あいまいさを緩和し、さらに性能を向上させることができる。コードとモデルはhttps://github.com/LooperXX/ManagerTower.comで入手できる。

論文の概要: Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

関連論文リスト