Fugu-MT 論文翻訳(概要): LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

論文の概要: LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

arxiv url: http://arxiv.org/abs/2509.23661v2
Date: Thu, 09 Oct 2025 11:54:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-10 15:34:28.723099
Title: LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training
Title（参考訳）: LLaVA-OneVision-1.5: 民主化されたマルチモーダルトレーニングのための完全なオープンフレームワーク
Authors: Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, Huajie Tan, Chunyuan Li, Jing Yang, Jie Yu, Xiyao Wang, Bin Qin, Yumeng Wang, Zizhen Yan, Ziyong Feng, Ziwei Liu, Bo Li, Jiankang Deng,
Abstract要約: LLaVA-OneVision-1.5はLMMの新しいファミリーであるこれは、計算コストと財政コストを大幅に削減して最先端のパフォーマンスを達成する。
参考スコア（独自算出の注目度）: 92.9242035107991
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present LLaVA-OneVision-1.5, a novel family of Large Multimodal Models (LMMs) that achieve state-of-the-art performance with significantly reduced computational and financial costs. Different from the existing works, LLaVA-OneVision-1.5 provides an open, efficient, and reproducible framework for building high-quality vision-language models entirely from scratch. The LLaVA-OneVision-1.5 release comprises three primary components: (1) Large-Scale Curated Datasets: We construct an 85M concept-balanced pretraining dataset LLaVA-OneVision-1.5-Mid-Traning and a meticulously curated 22M instruction dataset LLaVA-OneVision-1.5-Instruct. (2) Efficient Training Framework: We develop a complete end-to-end efficient training framework leveraging an offline parallel data packing strategy to facilitate the training of LLaVA-OneVision-1.5 within a $16,000 budget. (3) State-of-the-art Performance: Experimental results demonstrate that LLaVA-OneVision-1.5 yields exceptionally competitive performance across a broad range of downstream tasks. Specifically, LLaVA-OneVision-1.5-8B outperforms Qwen2.5-VL-7B on 18 of 27 benchmarks, and LLaVA-OneVision-1.5-4B surpasses Qwen2.5-VL-3B on all 27 benchmarks. We anticipate releasing LLaVA-OneVision-1.5-RL shortly and encourage the community to await further updates.
Abstract（参考訳）: 本稿では,LLaVA-OneVision-1.5を提案する。LLaVA-OneVision-1.5はLMM(Large Multimodal Models)の新たなファミリーで,計算コストと費用を大幅に削減して最先端の性能を実現する。既存の作業とは異なり、LLaVA-OneVision-1.5は、スクラッチから高品質の視覚言語モデルを構築するための、オープンで効率的で再現可能なフレームワークを提供する。 LLaVA-OneVision-1.5リリースは、(1)大規模キュレートデータセット:85Mの概念バランス付き事前学習データセットLLaVA-OneVision-1.5-Mid-Traningと、精巧にキュレートされた22M命令データセットLLaVA-OneVision-1.5-Instructの3つの主要コンポーネントから構成される。 2) 効率的なトレーニングフレームワーク: LLaVA-OneVision-1.5のトレーニングを容易にするために,オフライン並列データパッキング戦略を活用する,エンドツーエンドの効率的なトレーニングフレームワークを開発する。 (3)最先端性能: 実験結果から,LLaVA-OneVision-1.5は幅広い下流タスクにおいて非常に競争力のある性能を示す。具体的には、LLaVA-OneVision-1.5-8Bは27ベンチマーク中18ベンチマークでQwen2.5-VL-7Bを上回っ、LLaVA-OneVision-1.5-4Bは27ベンチマークでQwen2.5-VL-3Bを上回っている。近いうちにLLaVA-OneVision-1.5-RLをリリースする予定です。

論文の概要: LLaVA-OneVision-1.5: Fully Open Framework for Democratized Multimodal Training

関連論文リスト