Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval
Abstract Overview
This paper reproduces Planning Ahead in Generative Retrieval (PAG) using the authors' released checkpoint, identifier sets, and trie, and then stress-tests the method under query variation and cross-lingual query shift conditions. The study focuses on how PAG's planning stage supplies a document-level look-ahead bonus during trie-constrained beam search, and introduces diagnostics for candidate-set drift, planner-token drift, plan swapping, and plan collapse. Under the reported inference setup, reproduced results match the original effectiveness on MS MARCO Dev and TREC-DL 2019/2020 within 0.002 absolute difference, while corroborating the expected beam-size versus latency trade-off. Beyond reproduction, the authors show that lexical query variations (misspellings, synonyms, paraphrases) can destabilize the planning signal and reduce the usefulness of planning-guided decoding, and that under fixed-index cross-lingual retrieval, query translation into English recovers more performance than lightweight planner-token alignment.
Novelty
The paper's main novelty is a systematic reproduction and robustness analysis of PAG that instruments the intermediate planning signal rather than evaluating only end-to-end ranking metrics. It introduces explicit plan-drift and plan-collapse diagnostics (CandOverlap@K, TokJaccard@ℓ, PlanSwapDrop, SeqGain) and extends evaluation to fixed-index cross-lingual query shift with query-side mitigation strategies that avoid re-indexing.
Results
Using the released artifacts, the authors reproduce PAG's headline effectiveness within 0.002 absolute difference, including MRR@10 of 0.386 on MS MARCO Dev and NDCG@10 of 0.703/0.701 on TREC-DL 2019/2020. Stress tests show that misspellings, synonym substitutions, and paraphrases cause substantially larger effectiveness degradation (e.g., 0.217 NDCG@10 drop for misspellings on DL19) than reordering (0.014), with corresponding drops in candidate-set and planner-token overlap. In cross-lingual settings with a fixed English index, query translation yields the strongest recovery (e.g., MRR@10 improving from 0.090 to 0.230 for Dutch), while planner-token alignment provides only partial gains.
Key Points
- The released PAG artifacts are sufficient to reproduce the paper's main inference-time effectiveness results within 0.002 absolute difference and the qualitative beam-size versus latency trade-off, with ablations confirming that removing the look-ahead term reduces MRR@10 by 0.036 and planning-only retrieval drops further by 0.083.
- The planning signal is brittle under lexical surface-form variation: candidate-set overlap (CandOverlap@100) drops to 0.31–0.50 for misspellings and synonyms versus ~0.80 for reordering, with plan collapse rates reaching 9.6–11.6% on TREC-DL under harder perturbations, coinciding with weakened guided decoding.
- Under a fixed English index, cross-lingual performance degrades substantially (e.g., naive Dutch MRR@10 of 0.090), and translating queries into English is considerably more effective than lightweight planner-token alignment for restoring planner overlap and retrieval quality (Dutch MRR@10 0.230 vs. 0.107).