Fugu-MT 論文翻訳(概要): Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

論文の概要: Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

arxiv url: http://arxiv.org/abs/2603.06569v1
Date: Fri, 06 Mar 2026 18:58:04 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 13:17:46.409934
Title: Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders
Title（参考訳）: Penguin-VL:LLMベースのビジョンエンコーダを用いたVLMの効率限界探索
Authors: Boqiang Zhang, Lei Ke, Ruihan Yang, Qi Gao, Tianyuan Qu, Rossell Chen, Dong Yu, Leoweiliang,
Abstract要約: ビジョン言語モデル(VLM)の開発は、モデルのサイズのスケーリングに大きく依存している。本稿では,テキストのみのLLMによる視覚エンコーダPenguin-VLを提案する。実験の結果,ペンギンエンコーダは従来のコントラスト前訓練に優れた代替手段であることがわかった。
参考スコア（独自算出の注目度）: 40.81958598891815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Language Model (VLM) development has largely relied on scaling model size, which hinders deployment on compute-constrained mobile and edge devices such as smartphones and robots. In this work, we explore the performance limits of compact (e.g., 2B and 8B) VLMs. We challenge the prevailing practice that state-of-the-art VLMs must rely on vision encoders initialized via massive contrastive pretraining (e.g., CLIP/SigLIP). We identify an objective mismatch: contrastive learning, optimized for discrimination, enforces coarse and category-level invariances that suppress fine-grained visual cues needed for dense captioning and complex VLM reasoning. To address this issue, we present Penguin-VL, whose vision encoder is initialized from a text-only LLM. Our experiments reveal that Penguin-Encoder serves as a superior alternative to traditional contrastive pretraining, unlocking a higher degree of visual fidelity and data efficiency for multimodal understanding. Across various image and video benchmarks, Penguin-VL achieves performance comparable to leading VLMs (e.g., Qwen3-VL) in mathematical reasoning and surpasses them in tasks such as document understanding, visual knowledge, and multi-perspective video understanding. Notably, these gains are achieved with a lightweight architecture, demonstrating that improved visual representation rather than model scaling is the primary driver of performance. Our ablations show that Penguin-Encoder consistently outperforms contrastive-pretrained encoders, preserving fine-grained spatial and temporal cues that are critical for dense perception and complex reasoning. This makes it a strong drop-in alternative for compute-efficient VLMs and enables high performance in resource-constrained settings. Code: https://github.com/tencent-ailab/Penguin-VL
Abstract（参考訳）: ビジョン言語モデル(VLM)の開発は、スマートフォンやロボットなどの計算制約のあるモバイルデバイスやエッジデバイスへのデプロイを妨げる、モデルサイズのスケーリングに大きく依存している。本研究では,コンパクト(eg,2B,8B)VLMの性能限界について検討する。我々は、最先端のVLMが大規模なコントラスト事前学習(例えばCLIP/SigLIP)によって初期化されるビジョンエンコーダに頼らなければならないという一般的な実践に挑戦する。比較学習は、識別に最適化され、粗い、カテゴリーレベルの不変性を強制し、密接なキャプションや複雑なVLM推論に必要な細粒度の視覚的手がかりを抑える。本稿では,テキストのみのLLMから視覚エンコーダを初期化するPenguin-VLを提案する。実験の結果,Penguin-Encoderは従来のコントラスト事前学習の代替として優れており,マルチモーダル理解のための高度な視覚的忠実度とデータ効率を解放していることがわかった。様々な画像とビデオのベンチマークにおいて、Penguin-VLは、数理推論における主要なVLM(例えば、Qwen3-VL)に匹敵するパフォーマンスを達成し、文書理解、視覚知識、多視点ビデオ理解といったタスクでそれらを上回っている。特に、これらの成果は軽量なアーキテクチャで達成されており、モデルスケーリングよりも視覚的表現の改善がパフォーマンスの主要な要因であることを実証している。我々はペンギン・エンコーダがコントラストに制限されたエンコーダを一貫して上回り、密集した知覚や複雑な推論に重要な空間的・時間的手がかりを保存していることを示す。これにより、計算効率のよいVLMの強力な代替手段となり、リソース制約のある設定で高いパフォーマンスを実現する。コード:https://github.com/tencent-ailab/Penguin-VL

論文の概要: Penguin-VL: Exploring the Efficiency Limits of VLM with LLM-based Vision Encoders

関連論文リスト