FuguReport

Self-Prophetic Decoding to Unlock Visual Search in LVLMs

Authors Zhendong He, Qiyuan Dai, Guanbin Li, Liang Lin, Sibei Yang
Affiliations Sun Yat-Sen University / ShanghaiTech University
Categories Method / Decoding / Self-prophetic decoding framework, Application / Visual Search / Multi-step multi-modal inference, Evaluation / Model Capability Assessment / Addressing inference context interference
License CC BY 4.0

Abstract Overview

This paper studies visual search in large vision-language models (LVLMs) and argues that current post-trained search models suffer from two linked problems: degradation of intrinsic single-step abilities after visual-search training and interference from long multi-step reasoning contexts. To address this, the authors introduce SeProD, a self-prophetic decoding framework that pairs a post-trained visual-search model with its pre-training counterpart. The pre-training model acts as a prophet that proposes single-step prefixes, while the search model selectively accepts these tokens through a probability-based decoding rule designed to preserve its native output distribution. The method is training-free, plug-and-play, and uses parallel prefix evaluation so that the additional guidance does not add inference overhead in the reported setup.

Novelty

The distinctive idea is to use self-regulation between a post-trained LVLM and its pre-training counterpart during inference, rather than adding external tools or retraining the model. The paper also proposes a probability-based prophetic sampling interface that treats the pre-training model's outputs as candidate generated tokens and accepts them only when they are sufficiently consistent with both models' distributions.

Results

Across all 12 splits of 4 visual-search benchmark settings, SeProD improves Pixel Reasoner, DeepEyes, and Mini-o3 over their original versions, with especially strong gains on harder VisualProbe and spatial reasoning subsets. The paper also reports better performance on general VQA benchmarks, including improvements from Mini-o3 to SeProD on MME-RealWorld (65.5 to 67.7), ScienceQA (84.5 to 85.4), OCRBench (83.8 to 85.3), and CVBench (74.4 to 78.4). On VisualProbe with Mini-o3, accepted prophetic-prefix rates range from 74.2% to 80.7%, alongside reported inference speedups of 1.03x to 1.07x.

Key Points

  1. SeProD addresses capability incompatibility after visual-search post-training and interference from long multi-step contexts by coupling a search model with its pre-training counterpart.
  2. Its core mechanism is probability-based prophetic decoding, where candidate prefix tokens from the prophet are accepted only if they align with both the prophet and search model distributions.
  3. Experiments show consistent performance gains across multiple LVLM families and high-resolution visual-search benchmarks, while the reported implementation maintains no added computational overhead and even achieves slight speedups due to parallel evaluation.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.