Fugu-MT 論文翻訳(概要): RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

論文の概要: RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

arxiv url: http://arxiv.org/abs/2605.00392v1
Date: Fri, 01 May 2026 04:30:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-04 17:43:28.843463
Title: RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference
Title（参考訳）: RTPrune: 効率的なDeepSeek-OCR推論のためのリードツースインスパイアされたToken Pruning
Authors: Ben Wan, Yan Feng, Zihan Tang, Weizhe Huang, Yuting Zeng, Jia Wang, Tongxuan Liu,
Abstract要約: 本稿では,DeepSeek-OCRに適した2段階のトークンプルーニング手法を提案する。第1段階では,有能なテキストおよび構造情報をキャプチャするハイノームな視覚トークンを優先する。第2段階では、残りのトークンは最適輸送理論に基づいてペア化され、マージされ、効率的な特徴集合が達成される。
参考スコア（独自算出の注目度）: 17.01369106080539
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DeepSeek-OCR leverages visual-text compression to reduce long-text processing costs and accelerate inference, yet visual tokens remain prone to redundant textual and structural information. Moreover, current token pruning methods for conventional vision-language models (VLMs) fail to preserve textual fidelity due to improper compression mechanisms. By analyzing the decoding process of DeepSeek-OCR, we find that a distinct two-stage reading trajectory: the model initially prioritizes the majority of high-norm tokens, then subsequently redistributes its attention to the remaining ones. Motivated by this insight, we propose RTPrune, a two-stage token pruning method tailored for DeepSeek-OCR. In the first stage, we prioritize high-norm visual tokens that capture salient textual and structural information. In the second stage, the remaining tokens are paired and merged based on optimal transport theory to achieve efficient feature aggregation. We further introduce a dynamic pruning ratio that adapts to token similarity and textual density for OCR tasks, enabling a better efficiency-accuracy trade-off. Extensive experiments demonstrate state-of-the-art performance, as evidenced by 99.47% accuracy and 1.23$\times$ faster prefill on OmniDocBench, achieved with 84.25% token retention when applied to DeepSeek-OCR-Large.
Abstract（参考訳）: DeepSeek-OCRは、ビジュアルテキスト圧縮を活用して、長いテキスト処理コストを削減し、推論を高速化する。さらに、従来の視覚言語モデル(VLM)の現在のトークンプルーニング手法では、不適切な圧縮機構によるテキストの忠実さの維持が困難である。 DeepSeek-OCRの復号過程を解析することにより、2段階の読み出し軌跡が明らかになる: モデルは最初、ハイノームトークンの大部分を優先し、その後、残りのトークンに注意を向ける。この知見に触発されて,DeepSeek-OCRに適した2段階のトークンプルーニング手法であるRTPruneを提案する。第1段階では,有能なテキストおよび構造情報をキャプチャするハイノームな視覚トークンを優先する。第2段階では、残りのトークンは最適輸送理論に基づいてペア化され、マージされ、効率的な特徴集合が達成される。さらに、OCRタスクのトークン類似性とテキスト密度に適応する動的プルーニング比を導入し、効率と精度のトレードオフを改善する。 99.47%の精度と1.23$\times$ faster prefill on OmniDocBenchはDeepSeek-OCR-Largeに適用すると84.25%のトークン保持を達成した。

論文の概要: RTPrune: Reading-Twice Inspired Token Pruning for Efficient DeepSeek-OCR Inference

関連論文リスト