Self-Prophetic Decoding to Unlock Visual Search in LVLMs
Abstract Overview
This paper studies visual search in large vision-language models (LVLMs) and argues that current post-trained search models suffer from two linked problems: degradation of intrinsic single-step abilities after visual-search training and interference from long multi-step reasoning contexts. To address this, the authors introduce SeProD, a self-prophetic decoding framework that pairs a post-trained visual-search model with its pre-training counterpart. The pre-training model acts as a prophet that proposes single-step prefixes, while the search model selectively accepts these tokens through a probability-based decoding rule designed to preserve its native output distribution. The method is training-free, plug-and-play, and uses parallel prefix evaluation so that the additional guidance does not add inference overhead in the reported setup.
Novelty
The distinctive idea is to use self-regulation between a post-trained LVLM and its pre-training counterpart during inference, rather than adding external tools or retraining the model. The paper also proposes a probability-based prophetic sampling interface that treats the pre-training model's outputs as candidate generated tokens and accepts them only when they are sufficiently consistent with both models' distributions.
Results
Across all 12 splits of 4 visual-search benchmark settings, SeProD improves Pixel Reasoner, DeepEyes, and Mini-o3 over their original versions, with especially strong gains on harder VisualProbe and spatial reasoning subsets. The paper also reports better performance on general VQA benchmarks, including improvements from Mini-o3 to SeProD on MME-RealWorld (65.5 to 67.7), ScienceQA (84.5 to 85.4), OCRBench (83.8 to 85.3), and CVBench (74.4 to 78.4). On VisualProbe with Mini-o3, accepted prophetic-prefix rates range from 74.2% to 80.7%, alongside reported inference speedups of 1.03x to 1.07x.
Key Points
- SeProD addresses capability incompatibility after visual-search post-training and interference from long multi-step contexts by coupling a search model with its pre-training counterpart.
- Its core mechanism is probability-based prophetic decoding, where candidate prefix tokens from the prophet are accepted only if they align with both the prophet and search model distributions.
- Experiments show consistent performance gains across multiple LVLM families and high-resolution visual-search benchmarks, while the reported implementation maintains no added computational overhead and even achieves slight speedups due to parallel evaluation.