Fugu-MT 論文翻訳(概要): Kwai Keye-VL 1.5 Technical Report

論文の概要: Kwai Keye-VL 1.5 Technical Report

arxiv url: http://arxiv.org/abs/2509.01563v1
Date: Mon, 01 Sep 2025 15:46:58 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.756195
Title: Kwai Keye-VL 1.5 Technical Report
Title（参考訳）: Kwai Keye-VL 1.5テクニカルレポート
Authors: Biao Yang, Bin Wen, Boyang Ding, Changyi Liu, Chenglong Chu, Chengru Song, Chongling Rao, Chuan Yi, Da Li, Dunju Zang, Fan Yang, Guorui Zhou, Guowang Zhang, Han Shen, Hao Peng, Haojie Ding, Hao Wang, Hengrui Ju, Jiaming Huang, Jiangxia Cao, Jiankang Chen, Jingyun Hua, Kaibing Chen, Kaiyu Jiang, Kaiyu Tang, Kun Gai, Muhao Wei, Qiang Wang, Ruitao Wang, Sen Na, Shengnan Zhang, Siyang Mao, Sui Huang, Tianke Zhang, Tingting Gao, Wei Chen, Wei Yuan, Xiangyu Wu, Xiao Hu, Xingyu Lu, Yi-Fan Zhang, Yiping Yang, Yulong Chen, Zeyi Lu, Zhenhua Wu, Zhixin Ling, Zhuoran Yang, Ziming Li, Di Xu, Haixuan Gao, Hang Li, Jing Wang, Lejian Ren, Qigen Hu, Qianqian Wang, Shiyao Wang, Xinchen Luo, Yan Li, Yuhang Hu, Zixing Zhang,
Abstract要約: 本稿では、ビデオ理解における根本的な課題を3つの重要なイノベーションを通じて解決するKeye-VL-1.5を紹介する。まず,フレーム間の類似性に基づいて動的に計算資源を割り当てるSlow-Fastビデオ符号化方式を提案する。次に,モデルのコンテキスト長を8Kから128Kまで体系的に拡張する4段階事前学習手法を提案する。第3に、推論の強化と人間の嗜好の整合性に焦点を当てた総合的な後学習パイプラインを開発する。
参考スコア（独自算出の注目度）: 91.29118808371992
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In recent years, the development of Large Language Models (LLMs) has significantly advanced, extending their capabilities to multimodal tasks through Multimodal Large Language Models (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.
Abstract（参考訳）: 近年,Large Language Models (LLMs) の開発が著しく進展し,Multimodal Large Language Models (MLLMs) を通じてマルチモーダルタスクに拡張されている。しかし、ビデオのダイナミックで情報に富む性質のため、ビデオの理解は依然として困難な領域である。既存のモデルは、ビデオコンテンツを処理する際の空間分解能と時間的カバレッジのトレードオフに苦慮している。本稿では、ビデオ理解における根本的な課題を3つの重要なイノベーションを通じて解決するKeye-VL-1.5を紹介する。まず,フレーム間の類似性に基づいて動的に計算資源を割り当てるSlow-Fastビデオ符号化方式を提案する。次に,8Kトークンから128Kトークンへのモデルコンテキスト長を体系的に拡張し,より長いビデオとより複雑なビジュアルコンテンツの処理を可能にする,プログレッシブな4段階事前学習手法を実装した。第3に,5段階のチェーン・オブ・プリンシパルデータ構築プロセス,GSPOに基づく逐次的強化学習,アライメントトレーニングを取り入れ,推論の強化と人間の嗜好の整合性に着目した総合的なポストトレーニングパイプラインを構築した。公開ベンチマークの広範な評価と厳密な内部評価を通じて、Keye-VL-1.5は、既存のモデル、特にビデオ理解タスクにおいて優れた性能を示しながら、一般的なマルチモーダルベンチマークの競争性能を維持している。

論文の概要: Kwai Keye-VL 1.5 Technical Report

関連論文リスト