Fugu-MT 論文翻訳(概要): DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

論文の概要: DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

arxiv url: http://arxiv.org/abs/2604.22281v1
Date: Fri, 24 Apr 2026 06:51:58 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-27 15:36:26.370501
Title: DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
Title（参考訳）: DocPrune:背景, 質問, 包括的知識による効果的な文書質問回答
Authors: Joonmyung Choi, Sanghyeok Lee, Jongha Kim, Sehyung Kim, Dohwan Ko, Jihyung Kil, Hyunwoo J. Kim,
Abstract要約: トレーニングフリーでプログレッシブな文書トークン解析フレームワークであるDocPruneを提案する。 M3DocRAGの実験により,DocPruneはエンコーダとデコーダのスループットを3.0倍,デコーダの3.3倍向上した。
参考スコア（独自算出の注目度）: 41.26256203983725
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in vision-language models have demonstrated remarkable performance across diverse multi-modal tasks, including document question answering that leverages structured visual cues from text, tables, and figures. However, unlike natural images, document images contain large backgrounds and only sparse supporting evidence, leading to the inefficient consumption of substantial computational resources, especially for long documents. We observe that existing token-reduction methods for natural images and videos fall short in utilizing the structural sparsity unique to documents. To address this, we propose DocPrune, a training-free and progressive document token pruning framework designed for efficient long-document understanding. The proposed method preserves only the essential tokens for the task while removing unnecessary ones, such as background or question-irrelevant tokens. Moreover, it automatically selects the appropriate layers to initiate token pruning based on the model's level of comprehension. Our experiments on the M3DocRAG show that DocPrune improves throughput by 3.0x and 3.3x in the encoder and decoder, respectively, while boosting the F1 score by +1.0, achieving both higher accuracy and efficiency without any additional training.
Abstract（参考訳）: 視覚言語モデルの最近の進歩は、テキスト、表、図形から構造化された視覚的手がかりを利用する文書質問応答など、多様なマルチモーダルタスクにまたがる顕著なパフォーマンスを示している。しかし、自然画像とは異なり、文書画像には大きな背景があり、証拠が不足しているだけであり、特に長い文書の場合、かなりの計算資源の非効率消費につながる。自然画像やビデオに対する既存のトークン還元手法は,文書特有の構造的空間性を利用するには不十分である。そこで我々はDocPruneを提案する。DocPruneは、長期文書の効率的な理解のために設計された訓練不要でプログレッシブな文書トークン解析フレームワークである。提案手法は,タスクに必要なトークンのみを保存し,バックグラウンドや質問非関連トークンなどの不要トークンを除去する。さらに、モデルの理解レベルに基づいてトークンプルーニングを開始するための適切なレイヤを自動的に選択する。 M3DocRAG を用いた実験により,DocPrune はエンコーダとデコーダのスループットを 3.0x と 3.3x 向上し,F1 のスコアを +1.0 に向上させ,さらなるトレーニングを行わずに高い精度と効率を達成することができた。

論文の概要: DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

関連論文リスト