Fugu-MT 論文翻訳(概要): OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

論文の概要: OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

arxiv url: http://arxiv.org/abs/2509.01644v1
Date: Mon, 01 Sep 2025 17:38:21 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.804034
Title: OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Title（参考訳）: OpenVision 2:マルチモーダル学習のための生成事前学習型ビジュアルエンコーダのファミリー
Authors: Yanqing Liu, Xianhang Li, Letian Zhang, Zirui Wang, Zeyu Zheng, Yuyin Zhou, Cihang Xie,
Abstract要約: トレーニング効率を向上させるため,OpenVisionのアーキテクチャと損失設計を簡素化する。 OpenVision 2は、トレーニング時間とメモリ消費の両方を大幅に削減しながら、幅広いマルチモーダルベンチマークでオリジナルのモデルのパフォーマンスにマッチする。この優れたトレーニング効率により、OpenVisionで使用されている最大のビジョンエンコーダをはるかに超え、10億以上のパラメータに到達することが可能になります。
参考スコア（独自算出の注目度）: 68.04264015433857
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper provides a simplification on OpenVision's architecture and loss design for enhancing its training efficiency. Following the prior vision-language pretraining works CapPa and AIMv2, as well as modern multimodal designs like LLaVA, our changes are straightforward: we remove the text encoder (and therefore the contrastive loss), retaining only the captioning loss as a purely generative training signal. We name this new version OpenVision 2. The initial results are promising: despite this simplification, OpenVision 2 competitively matches the original model's performance on a broad set of multimodal benchmarks while substantially cutting both training time and memory consumption. For example, with ViT-L/14, it reduces training time by about 1.5x (from 83h to 57h), and memory usage by about 1.8x (from 24.5GB to 13.8GB, equivalently allowing the maximum batch size to grow from 2k to 8k). This superior training efficiency also allows us to scale far beyond the largest vision encoder used in OpenVision, reaching more than 1 billion parameters. We hold a strong belief that this lightweight, generative-only paradigm is compelling for future vision encoder development in multimodal foundation models.
Abstract（参考訳）: 本稿では,OpenVisionのアーキテクチャと損失設計を簡略化し,トレーニング効率を向上させる。従来の視覚言語による事前訓練作業であるCapPaとAIMv2に加えて、LLaVAのようなモダンなマルチモーダルデザインにも従えば、私たちの変更は簡単です。私たちはこの新バージョンをOpenVision 2.0と名付けます。この単純化にもかかわらず、OpenVision 2はトレーニング時間とメモリ消費を大幅に削減しながら、幅広いマルチモーダルベンチマークでオリジナルのモデルのパフォーマンスと競合する。例えば、ViT-L/14では、トレーニング時間を約1.5倍(83hから57h)、メモリ使用量を約1.8倍(24.5GBから13.8GB)削減し、最大バッチサイズを2kから8kに拡大する。この優れたトレーニング効率により、OpenVisionで使用されている最大のビジョンエンコーダをはるかに超え、10億以上のパラメータに到達することが可能になります。我々は、この軽量で生成のみのパラダイムが、マルチモーダル基盤モデルにおける将来のビジョンエンコーダ開発に魅力的なものであると強く信じている。

論文の概要: OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

関連論文リスト