Fugu-MT 論文翻訳(概要): What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

論文の概要: What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

arxiv url: http://arxiv.org/abs/2605.12549v1
Date: Sun, 10 May 2026 07:04:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 23:30:27.568966
Title: What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
Title（参考訳）: 復号前に何が起こるか? VLMにおけるGUIグラウンドのプリフィル決定
Authors: Jiaping Lin, Fei Shen, Junzhe Li, Ping Nie, Fei Yu, Ming Li, Haizhou Li,
Abstract要約: VLM(Vision-Language Models)におけるGUIグラウンディング中に何が起こるかを調べ、これまで見過ごされていたボトルネックを特定する。プリフィルステージは候補UI要素を決定するが、デコードステージは最終的な座標を洗練させる。 Re-Prefillは、注意誘導された第2のプリフィルステージを導入して、ターゲット選択を洗練させることによって推論を再考する、トレーニング不要な手法である。
参考スコア（独自算出の注目度）: 33.91859613266694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.
Abstract（参考訳）: GUIグラウンディングのための既存のトレーニング不要のアプローチは、しばしばターゲット要素を特定するために反復的トリッピングや候補集約のような複数の推論実行に依存する。この追加の計算にもかかわらず、各フォワードパスは命令を独立に解釈し、視覚的トークン間のプログレッシブな相互作用を許さずに、視覚的レイアウトを解析する。本稿では,視覚言語モデル(VLM)におけるGUI接地中に発生することについて検討し,これまで見過ごされていたボトルネックを特定する。プリフィルステージは候補UI要素を決定するが、デコードステージは最終的な座標を洗練させる。この非対称性は、候補選択における誤りを復号時に効果的に修正できないため、プリフィルを臨界ステップとして確立する。そこで本研究では,目標選択を洗練させるために,注意誘導の第2準備段階を導入して推論を再検討するトレーニングフリーの手法であるRe-Prefillを提案する。具体的には、クエリ位置から常に注目される視覚的トークン、すなわち最終トークンを予備目標仮説として抽出し、命令された隠れ状態とともに入力に付加することにより、モデルが座標生成前にその決定を深く再考することができる。 4つのVLMと5つのベンチマーク(ScreenSpot-Pro、ScreenSpot-V2、OSWorld-G、UI-Vision、MMBench-GUI)での実験では、ScreenSpot-Proでは最大4.3%向上した。コードはhttps://github.com/linjiaping1/Re-Prefill.comから入手できる。

論文の概要: What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

関連論文リスト