FuguReport

Lost in Decoding? Reproducing and Stress-Testing the Look-Ahead Prior in Generative Retrieval

Authors Kidist Amde Mekonnen, Yongkang Li, Yubao Tang, Simon Lupart, Maarten de Rijke
Affiliations University of Amsterdam
Categories Method / Decoding / Look-ahead prior computation in generative retrieval, Evaluation / Model Robustness / Stress-testing prefix pruning in beam search, Task / Information Retrieval / Document ranking by generative methods
License CC BY 4.0

Abstract Overview

This paper reproduces Planning Ahead in Generative Retrieval (PAG) using the authors' released checkpoint, identifier sets, and trie, and then stress-tests the method under query variation and cross-lingual query shift conditions. The study focuses on how PAG's planning stage supplies a document-level look-ahead bonus during trie-constrained beam search, and introduces diagnostics for candidate-set drift, planner-token drift, plan swapping, and plan collapse. Under the reported inference setup, reproduced results match the original effectiveness on MS MARCO Dev and TREC-DL 2019/2020 within 0.002 absolute difference, while corroborating the expected beam-size versus latency trade-off. Beyond reproduction, the authors show that lexical query variations (misspellings, synonyms, paraphrases) can destabilize the planning signal and reduce the usefulness of planning-guided decoding, and that under fixed-index cross-lingual retrieval, query translation into English recovers more performance than lightweight planner-token alignment.

Novelty

The paper's main novelty is a systematic reproduction and robustness analysis of PAG that instruments the intermediate planning signal rather than evaluating only end-to-end ranking metrics. It introduces explicit plan-drift and plan-collapse diagnostics (CandOverlap@K, TokJaccard@ℓ, PlanSwapDrop, SeqGain) and extends evaluation to fixed-index cross-lingual query shift with query-side mitigation strategies that avoid re-indexing.

Results

Using the released artifacts, the authors reproduce PAG's headline effectiveness within 0.002 absolute difference, including MRR@10 of 0.386 on MS MARCO Dev and NDCG@10 of 0.703/0.701 on TREC-DL 2019/2020. Stress tests show that misspellings, synonym substitutions, and paraphrases cause substantially larger effectiveness degradation (e.g., 0.217 NDCG@10 drop for misspellings on DL19) than reordering (0.014), with corresponding drops in candidate-set and planner-token overlap. In cross-lingual settings with a fixed English index, query translation yields the strongest recovery (e.g., MRR@10 improving from 0.090 to 0.230 for Dutch), while planner-token alignment provides only partial gains.

Key Points

  1. The released PAG artifacts are sufficient to reproduce the paper's main inference-time effectiveness results within 0.002 absolute difference and the qualitative beam-size versus latency trade-off, with ablations confirming that removing the look-ahead term reduces MRR@10 by 0.036 and planning-only retrieval drops further by 0.083.
  2. The planning signal is brittle under lexical surface-form variation: candidate-set overlap (CandOverlap@100) drops to 0.31–0.50 for misspellings and synonyms versus ~0.80 for reordering, with plan collapse rates reaching 9.6–11.6% on TREC-DL under harder perturbations, coinciding with weakened guided decoding.
  3. Under a fixed English index, cross-lingual performance degrades substantially (e.g., naive Dutch MRR@10 of 0.090), and translating queries into English is considerably more effective than lightweight planner-token alignment for restoring planner overlap and retrieval quality (Dutch MRR@10 0.230 vs. 0.107).

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.