Fugu-MT 論文翻訳(概要): Ovis2.5 Technical Report

論文の概要: Ovis2.5 Technical Report

arxiv url: http://arxiv.org/abs/2508.11737v1
Date: Fri, 15 Aug 2025 17:01:08 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-19 14:49:10.350509
Title: Ovis2.5 Technical Report
Title（参考訳）: Ovis2.5テクニカルレポート
Authors: Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, Yuxuan Han, Haijun Li, Wanying Chen, Junke Tang, Chengkun Hou, Zhixing Du, Tianli Zhou, Wenjie Zhang, Huping Ding, Jiahe Li, Wen Li, Gui Hu, Yiliang Gu, Siran Yang, Jiamang Wang, Hailong Sun, Yibo Wang, Hui Sun, Jinlong Huang, Yuping He, Shengze Shi, Weihong Zhang, Guodong Zheng, Junpeng Jiang, Sensen Gao, Yi-Feng Wu, Sijia Chen, Yuhui Chen, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang,
Abstract要約: 我々は,Ovis2の後継となるOvis2.5を,ネイティブ解像度の視覚知覚と強力なマルチモーダル推論のために提案する。 Ovis2.5はネイティブ解像度で画像を処理できるネイティブ解像度のビジョントランスフォーマーを統合している。私たちは、リニアチェーンを越えてリフレクションを実行するようにモデルをトレーニングします。
参考スコア（独自算出の注目度）: 43.715004002753716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Ovis2.5, a successor to Ovis2 designed for native-resolution visual perception and strong multimodal reasoning. Ovis2.5 integrates a native-resolution vision transformer that processes images at their native, variable resolutions, avoiding the degradation from fixed-resolution tiling and preserving both fine detail and global layout -- crucial for visually dense content like complex charts. To strengthen reasoning, we train the model to move beyond linear chain-of-thought and perform reflection -- including self-checking and revision. This advanced capability is exposed as an optional "thinking mode" at inference time, allowing users to trade latency for enhanced accuracy on difficult inputs. The model is trained via a comprehensive five-phase curriculum that progressively builds its skills. The process begins with foundational visual and multimodal pretraining, advances through large-scale instruction tuning, and culminates in alignment and reasoning enhancement using DPO and GRPO. To scale these upgrades efficiently, we employ multimodal data packing and hybrid parallelism, yielding a significant end-to-end speedup. We release two open-source models: Ovis2.5-9B and Ovis2.5-2B. The latter continues the "small model, big performance" philosophy of Ovis2, making it ideal for resource-constrained, on-device scenarios. On the OpenCompass multimodal leaderboard, Ovis2.5-9B averages 78.3, marking a substantial improvement over its predecessor, Ovis2-8B, and achieving state-of-the-art results among open-source MLLMs in the sub-40B parameter range; Ovis2.5-2B scores 73.9, establishing SOTA for its size. Beyond aggregate scores, Ovis2.5 achieves leading results on STEM benchmarks, exhibits strong capabilities on grounding and video tasks, and achieves open-source SOTA at its scale for complex chart analysis.
Abstract（参考訳）: 我々は,Ovis2の後継となるOvis2.5を,ネイティブ解像度の視覚知覚と強力なマルチモーダル推論のために提案する。 Ovis2.5は、ネイティブ解像度の可変解像度で画像を処理し、固定解像度のタイリングの劣化を回避し、細部とグローバルレイアウトの両方を保存し、複雑なチャートのような視覚的に密集したコンテンツに不可欠なネイティブ解像度のビジョントランスフォーマーを統合する。推論を強化するため、リフレクション -- 自己チェックやリビジョンを含む -- を実施するようにモデルをトレーニングします。この高度な機能は、推論時にオプションの"シンキングモード"として公開され、ユーザは、困難な入力に対して、より正確なレイテンシを交換できる。モデルは、そのスキルを徐々に構築する包括的な5段階のカリキュラムによって訓練される。このプロセスは、基礎的な視覚的およびマルチモーダルな事前訓練から始まり、大規模命令チューニングによって進行し、DPOとGRPOを用いたアライメントと推論の強化が達成される。これらのアップグレードを効率的にスケールアップするために,マルチモーダルデータパッキングとハイブリッド並列処理を採用し,エンドツーエンドの高速化を実現している。 Ovis2.5-9BとOvis2.5-2Bの2つのオープンソースモデルをリリースする。後者はOvis2の"小さなモデル、大きなパフォーマンス"の哲学を継続しており、リソース制約のあるオンデバイスシナリオに最適である。 OpenCompassのマルチモーダル・リーダーボードでは、Ovis2.5-9Bの平均は78.3であり、前任のOvis2-8Bよりも大幅に改善され、40B以下のパラメータ範囲でオープンソースのMLLMの最先端の結果が得られた。集計スコアの他に、Ovis2.5はSTEMベンチマークの先行結果を達成し、グラウンディングとビデオタスクに強力な能力を示し、複雑なチャート解析のためにオープンソースのSOTAをその規模で達成している。

論文の概要: Ovis2.5 Technical Report

関連論文リスト