Fugu-MT 論文翻訳(概要): InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

論文の概要: InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

arxiv url: http://arxiv.org/abs/2512.08829v1
Date: Tue, 09 Dec 2025 17:18:32 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-10 22:28:08.064888
Title: InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models
Title（参考訳）: InfiniteVL:高能率無制限ビジョンランゲージモデルに対する線形とスパース注意の相乗化
Authors: Hongyuan Tao, Bencheng Liao, Shaoyu Chen, Haoran Yin, Qian Zhang, Wenyu Liu, Xinggang Wang,
Abstract要約: 我々は、Gated DeltaNetとSWA(Slide window attention)を相乗化する線形複雑VLMアーキテクチャであるInfiniteVLを提案する。 InfiniteVLは、一定のレイテンシとメモリフットプリントを維持しながら、3.6時間以上の推論高速化を実現する。ストリーミングビデオ理解のシナリオでは、長期メモリキャッシュを保持しながら、24FPSのリアルタイムプリフィル速度を安定的に維持する。
参考スコア（独自算出の注目度）: 49.08289742711585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Window attention and linear attention represent two principal strategies for mitigating the quadratic complexity and ever-growing KV cache in Vision-Language Models (VLMs). However, we observe that window-based VLMs suffer performance degradation when sequence length exceeds the window size, while linear attention underperforms on information-intensive tasks such as OCR and document understanding. To overcome these limitations, we propose InfiniteVL, a linear-complexity VLM architecture that synergizes sliding window attention (SWA) with Gated DeltaNet. For achieving competitive multimodal performance under constrained resources, we design a three-stage training strategy comprising distillation pretraining, instruction tuning, and long-sequence SFT. Remarkably, using less than 2\% of the training data required by leading VLMs, InfiniteVL not only substantially outperforms previous linear-complexity VLMs but also matches the performance of leading Transformer-based VLMs, while demonstrating effective long-term memory retention. Compared to similar-sized Transformer-based VLMs accelerated by FlashAttention-2, InfiniteVL achieves over 3.6\times inference speedup while maintaining constant latency and memory footprint. In streaming video understanding scenarios, it sustains a stable 24 FPS real-time prefill speed while preserving long-term memory cache. Code and models are available at https://github.com/hustvl/InfiniteVL.
Abstract（参考訳）: ウィンドウアテンションと線形アテンションはビジョン・ランゲージ・モデル(VLM)における二次的複雑性を緩和する2つの主要な戦略である。しかし,ウィンドウベースのVLMでは,シーケンス長がウィンドウサイズを超えると性能が低下するのに対して,線形注意はOCRや文書理解といった情報集約的なタスクでは不十分である。これらの制約を克服するために、Gated DeltaNetとSWA(Slide window attention)を相乗化する線形複雑VLMアーキテクチャであるInfiniteVLを提案する。制約資源下での競争力のあるマルチモーダル性能を実現するため,蒸留前訓練,指導訓練,長期SFTを含む3段階の訓練戦略を設計する。注目すべきは、VLMをリードするために必要なトレーニングデータの2\%未満を使用することで、InfiniteVLは従来の線形複雑度VLMよりも大幅に優れるだけでなく、トランスフォーマーベースのVLMのパフォーマンスに匹敵すると同時に、効果的な長期記憶保持を示す。 FlashAttention-2によって加速される同様のサイズのTransformerベースのVLMと比較して、InfiniteVLは一定のレイテンシとメモリフットプリントを維持しながら3.6\timesの推論スピードアップを達成する。ストリーミングビデオ理解のシナリオでは、長期メモリキャッシュを保持しながら、24FPSのリアルタイムプリフィル速度を安定的に維持する。コードとモデルはhttps://github.com/hustvl/InfiniteVL.comで入手できる。

論文の概要: InfiniteVL: Synergizing Linear and Sparse Attention for Highly-Efficient, Unlimited-Input Vision-Language Models

関連論文リスト