Fugu-MT 論文翻訳(概要): Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

論文の概要: Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

arxiv url: http://arxiv.org/abs/2605.13831v1
Date: Wed, 13 May 2026 17:52:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:28.219993
Title: Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context
Title（参考訳）: 128Kコンテキストを超える一般化による長期ビジョンランゲージモデルの訓練
Authors: Zhaowei Wang, Lishu Luo, Haodong Duan, Weiwei Liu, Sijin Wu, Ji Luo, Shen Yan, Shuai Peng, Sihang Yuan, Chaoyi Huang, Yi Lin, Yangqiu Song,
Abstract要約: 本稿では,LVLMの長期継続事前学習に関する体系的研究について述べる。まず、長文書VQAはOCR転写よりかなり効果的であることを示す。 MMProLongは,Qwen2.5-VL-7Bの長文継続事前学習で得られる5Bの予算しか持たない。
参考スコア（独自算出の注目度）: 64.09777482878083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Long-context modeling is becoming a core capability of modern large vision-language models (LVLMs), enabling sustained context management across long-document understanding, video analysis, and multi-turn tool use in agentic workflows. Yet practical training recipes remain insufficiently explored, particularly for designing and balancing long-context data mixtures. In this work, we present a systematic study of long-context continued pre-training for LVLMs, extending a 7B model from 32K to 128K context with extensive ablations on long-document data. We first show that long-document VQA is substantially more effective than OCR transcription. Building on this observation, our ablations further yield three key findings: i) for sequence-length distribution, balanced data outperforms target-length-focused data (e.g., 128K), suggesting that long-context ability requires generalizable key-information retrieval across various lengths and positions; ii) retrieval remains the primary bottleneck, favoring retrieval-heavy mixtures with modest reasoning data for task diversity; and iii) pure long-document VQA largely preserves short-context capabilities, suggesting that instruction-formatted long data reduces the need for short-data mixing. Based on these findings, we introduce MMProLong, obtained by long-context continued pre-training from Qwen2.5-VL-7B with only a 5B-token budget. MMProLong improves long-document VQA scores by 7.1% and maintains strong performance at 256K and 512K contexts beyond its 128K training window, without additional training. It further generalizes to webpage-based multimodal needle retrieval, long-context vision-text compression, and long-video understanding without task-specific supervision. Overall, our study establishes a practical LongPT recipe and an empirical foundation for advancing long-context vision-language models.
Abstract（参考訳）: 長期コンテキストモデリングは現代の大規模視覚言語モデル(LVLM)の中核的な機能となり、長期文書理解、ビデオ分析、エージェントワークフローで使用されるマルチターンツールの持続的なコンテキスト管理を可能にしている。しかし、実践的なトレーニングのレシピは、特に長いコンテキストデータ混合物を設計し、バランスをとるのに不十分なままである。本研究では,LVLMの長文継続事前学習に関する体系的研究を行い,32Kから128Kのコンテキストに 7B モデルを拡張し,長期文書データに対する広範な改善を行った。まず、長文書VQAはOCR転写よりかなり効果的であることを示す。この観察に基づいて、我々の信念はさらに3つの重要な発見をもたらす。 i) シーケンス長分布において、残高データは、ターゲット長中心のデータ(例えば、128K)より優れており、長文検索には、様々な長さ・位置にわたる一般化可能な鍵情報検索が必要であることを示唆している。二検索が主要なボトルネックであり、タスクの多様性に関する適度な推論データと検索に重大な混合を好むこと。三純長文書VQAは、主に短文の能力を保ち、命令形式長データは短データ混合の必要性を減らすことを示唆する。 MMProLongは,Qwen2.5-VL-7Bから長文継続事前訓練を行い,予算は5Bに留まった。 MMProLongは、長いドキュメントのVQAスコアを7.1%改善し、128Kのトレーニングウィンドウを超えて256Kと512Kのコンテキストで高いパフォーマンスを維持する。さらに、ウェブページベースのマルチモーダルニードル検索、長文の視覚テキスト圧縮、タスク固有の監督なしでのロングビデオ理解に一般化する。全体として,本研究は,LongPTの実践的なレシピと,長文視覚言語モデルの発展のための実証的基礎を確立している。

論文の概要: Training Long-Context Vision-Language Models Effectively with Generalization Beyond 128K Context

関連論文リスト