Fugu-MT 論文翻訳(概要): FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

論文の概要: FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

arxiv url: http://arxiv.org/abs/2605.17447v1
Date: Sun, 17 May 2026 13:39:47 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-19 17:57:48.09233
Title: FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing
Title（参考訳）: FastOCR: 効率的な文書解析のためのKVキャッシュプルーニングによる動的ビジュアル修正
Authors: Zihan Tang, Leqi Shen, Hui Chen, Ao Wang, Ben Wan, Yan Feng, Ke Zhang, Sicheng Zhao, Tongxuan Liu, Guiguang Ding,
Abstract要約: 我々は2つの相補的なモジュールを持つトレーニングフリーフレームワークであるFastOCRを提案する。 FastOCRは未実行モデルの精度の98%を保持し、デコードステップあたりの視覚トークンの5%にしか到達しない。
参考スコア（独自算出の注目度）: 51.905216364362325
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-Language Models (VLMs) have shown strong promise on Optical Character Recognition (OCR), yet the sheer number of visual tokens required to encode dense documents incurs prohibitive inference cost. Existing pruning methods rely on physical eviction, e.g., permanently discarding visual tokens during the prefill stage. While effective for natural images, this strategy fundamentally breaks down on OCR, where virtually every visual token may correspond to a character or structural element, and any irreversible loss leads to catastrophic accuracy degradation. We observe that, although document images appear globally dense and seemingly unprunable, the model's attention to them is in fact temporally sparse: at each decoding step it concentrates on a small region that shifts gradually across steps, much as a human reader fixates on successive words rather than perceiving an entire page at once. Motivated by this Dynamic Visual Fixation phenomenon, we recast the intractable global pruning problem as a tractable local, dynamic one and propose FastOCR, a training-free framework with two complementary modules. Specifically, Focal-Guided Pruning identifies a small set of focal layers and selects the most task-relevant visual tokens from them at each step, while Cross-Step Fixation Reuse exploits the gradual shift of fixation to warm-start each step from the previous one. By dynamically adjusting which tokens are attended rather than evicting any from the cache, FastOCR avoids permanent information loss. Extensive experiments show that FastOCR serves as a plug-and-play acceleration module, generalizing consistently across five VLMs of varying sizes and architectures. On Qwen2.5-VL, FastOCR retains 98% of the unpruned model's accuracy while attending to only 5% of the visual tokens per decoding step, reducing attention latency by 3.0$\times$.
Abstract（参考訳）: VLM(Vision-Language Models)は光学文字認識(OCR)に強い期待を抱いているが、高密度文書をエンコードするのに必要となる多くの視覚トークンは、違法な推論コストを引き起こす。既存のプルーニング法は、例えば、プリフィルの段階で視覚トークンを永久に破棄する物理的排除に依存している。自然画像に効果があるが、この戦略は基本的にOCRを分解し、事実上全ての視覚トークンが文字や構造要素に対応し、あらゆる不可逆的な損失が破滅的な精度の劣化につながる。文書画像は広範に密度が高く、表現不能に見えるが、そのモデルに対する注意は実際には時間的に疎い。各デコードステップでは、人間がページ全体を認識するのではなく、連続した単語に固定するのと同じように、ステップを徐々にシフトする小さな領域に集中する。この動的視覚固定現象に触発され、抽出可能なグローバルプルーニング問題を抽出可能な局所的動的問題として再キャストし、2つの相補的なモジュールを持つトレーニングフリーフレームワークであるFastOCRを提案する。具体的には、Focal-Guided Pruningは、小さなフォーカス層を特定し、各ステップで最もタスク関連のある視覚トークンを選択し、Cross-Step Fixation Reuseは、修正の段階的なシフトを利用して、前のステップから各ステップをウォームスタートする。キャッシュから排除するのではなく、どのトークンが出席しているかを動的に調整することで、FastOCRは恒久的な情報損失を避けることができる。大規模な実験により、FastOCRはプラグ・アンド・プレイ・アクセラレーションモジュールとして機能し、様々なサイズとアーキテクチャの5つのVLMを一貫して一般化している。 Qwen2.5-VLでは、FastOCRは未実行モデルの精度の98%を維持し、デコードステップあたりの視覚トークンの5%にしか到達せず、注意遅延を3.0$\times$に削減している。

論文の概要: FastOCR: Dynamic Visual Fixation via KV Cache Pruning for Efficient Document Parsing

関連論文リスト